Appendix: Interview Question Bank

Fundamentals

Tokenization

Why do modern LLMs use subword tokenization instead of word-level or character-level tokenization?
Compare BPE, WordPiece, and Unigram LM tokenizers—which should you choose in which setting?
What is byte-level BPE (BBPE)? Why is it better suited to multilingual models?
How do you trade off vocabulary size against compute and quality?
Your biomedical domain terms are being split into 8+ subword pieces—what would you do? What are the risks?
What is a tokenizer’s “fragmentation rate”? Why does it matter for cost and latency?
A teammate swapped the tokenizer between training and inference—what breaks?
Why can’t you directly compare perplexity across models that use different tokenizers?
How does tokenization affect arithmetic and numerical reasoning in LLMs?
What are the risks of extending a tokenizer midway through training? How do you mitigate them?
What is the relationship between token count and API cost? How do you optimize it in cost-sensitive deployments?
How do SentencePiece and HuggingFace Tokenizers differ in how they handle whitespace and unknown characters?
You are building a multilingual model (English, Chinese, Arabic, code). Walk through your tokenizer design decisions—vocabulary size, algorithm choice, byte fallback, special tokens.
A model that is good at code suddenly fails when users paste markdown fenced with backticks—what tokenization issue might be responsible?

Embeddings

Why can’t a model use token IDs directly as numbers—why does it need embeddings?
Explain embedding lookup—how is it equivalent to one-hot vectors × matrix multiplication?
What is “weight tying” (input embeddings = output softmax weights)? Why do it?
What is the difference between word/token embeddings and sentence embeddings?
Compare CBOW and Skip-gram—when is Skip-gram better?
How does FastText handle out-of-vocabulary words?
What is negative sampling? Why do you need it to train large-scale word2vec?
How would you evaluate the quality of learned embeddings?
Explain contextual embeddings. How do BERT-style models differ from static word2vec/GloVe embeddings?
What is the anisotropy problem in embedding spaces—why do modern LLM embeddings cluster into a narrow cone? How do you fix it?
How do cross-modal embeddings (such as CLIP) align text and images in a shared space? What is contrastive loss?
You are choosing an embedding model for a production RAG system—what dimensions do you evaluate (dimension size, latency, domain coverage, MTEB score)?

Attention

Explain scaled dot-product attention step by step. Why divide by √d_k?
What is a causal mask? Why is it necessary for decoder-only LLMs?
Why is vanilla self-attention O(L²) in sequence length? What are the practical consequences?
Explain multi-head attention (MHA). Why use multiple heads instead of one big head?
What is cross-attention? Where is it used?
Compare MHA, MQA, and GQA—how does each affect the KV cache and quality?
Explain the KV cache. Why do you need it in autoregressive decoding?
How does FlashAttention improve performance without changing the mathematics of attention?
What is sparse attention (for example, sliding window attention)? What do you gain, and what do you lose?
How do you turn vanilla softmax attention into “linear attention”? What are the tradeoffs?
Why does long-context prefill dominate time to first token (TTFT)?
You increase the context window from 4K to 128K—what breaks? What becomes expensive?
Derive the memory cost of the KV cache for a single Transformer layer as a function of batch size, sequence length, number of KV heads, and head dimension.
What is multi-head latent attention (MLA), as used in DeepSeek V2? How does it compress the KV cache even more aggressively than GQA?
Explain Ring Attention / Striped Attention. How does it distribute long sequences across multiple GPUs during training?

FFN, Activations, Residual Connections, Normalization

If attention already mixes information across tokens, why does a Transformer still need an FFN/MLP?
Compare ReLU, GELU, Swish, and SwiGLU—why do modern LLMs prefer gated activations?
What is the “dying ReLU” problem?
Why do residual (skip) connections help train deep networks?
Compare pre-norm and post-norm Transformers—which is more stable at depth, and why?
Compare LayerNorm and RMSNorm. Why do modern LLMs prefer RMSNorm?
What is DeepNorm? When do you need it?
What would happen if you removed all residual connections from a 70-layer Transformer?
FFNs typically use a hidden dimension that is 4× the model dimension (or 8/3× in SwiGLU). Why that ratio? What are the memory and compute implications?
What is the “residual stream” view of a Transformer? How does it help you reason about feature circuits and superposition?

Positional Encoding

Why do Transformers need positional encoding? What happens without it?
Explain sinusoidal positional encoding. Why use sin/cos at different frequencies?
What is RoPE (rotary position embedding)? Give the intuition in one sentence.
Compare absolute position, relative position, RoPE, and ALiBi—state the difference in one sentence.
What is length extrapolation? Why do some positional schemes degrade on longer contexts?
How do NTK-aware scaling and YaRN extend RoPE to longer contexts?
Your model was trained on 8K context, but users need 128K—what options do you have?
Why do some models (such as ALiBi) claim they do not need positional encoding at all? What are they actually doing?

Decoding Strategies

Compare greedy search, beam search, top-k, top-p (nucleus sampling), and temperature sampling.
What does temperature do mathematically? What goes wrong at very low temperature and very high temperature?
Why can beam search reduce diversity and produce bland outputs?
Top-k vs top-p—which is more robust across prompts, and why?
Explain best-of-N sampling. When would you use it instead of single-shot generation?
What is self-consistency / majority voting? When is it useful?
What is a stop sequence? Why does it matter for tool calling and structured outputs?
How does repetition penalty work? What can it accidentally break?
Users complain that the model is “too repetitive”—which decoding parameters do you check first?
Explain min-p sampling. How is it different from top-p? Why is it becoming more popular in open-source inference?
How does guided/constrained decoding (such as Outlines or LMFE) enforce a grammar or JSON schema during generation? What is the runtime overhead?

Architecture

Compare encoder-only, decoder-only, encoder-decoder, and PrefixLM architectures.
Why are most modern chat LLMs decoder-only?
In what scenarios would you prefer an encoder-only model (BERT-style) over a decoder-only model?
What is a mixture of experts (MoE)? Why does routing matter?
In MoE, what is load balancing? Why can it fail?
Which architecture choices tend to dominate inference memory during decoding?
Summarize state space models (for example, Mamba). What bottleneck do they target?
Compare Transformers, SSMs, and recurrent hybrid models (RWKV) on long-context tasks.
What is a diffusion language model? When might it beat autoregressive generation?
Explain the “modern decoder recipe”: RoPE + RMSNorm + SwiGLU + GQA. Why have LLaMA, Mistral, Qwen, and most open models converged on this pattern?
What is “parallel attention + FFN” in PaLM? What is the tradeoff between compute savings and quality?
What is multi-token prediction (MTP), as used in DeepSeek V3? How does predicting multiple future tokens improve the training signal and support speculative decoding?

Pretraining

Data Pipeline and Quality

Walk through the pretraining data pipeline: data sources → filtering → deduplication → tokenization → mixing.
Why is “data quality > data quantity” the most important principle in pretraining?
What filters would you apply to a Common Crawl dataset before pretraining?
Explain train-train deduplication and train-test contamination—why does each matter?
How would you build a pipeline to detect and prevent benchmark contamination?
You discover that the model is memorizing verbatim fragments of the training data—how would you diagnose and fix it?
What data governance practices matter in pretraining (licenses, PII, audit trails)?
How do you handle PII in training data at scale?
What are common deduplication strategies (exact hash, MinHash, embedding similarity)?
You need to build a classifier to filter toxic content from a pretraining corpus. What architecture would you use? How would you handle false positives that remove valuable data?
What is the role of a “quality classifier” (for example, one trained on Wikipedia vs random web pages) in a pretraining data pipeline? How did Llama 3 handle this?

Data Mixing and Distribution

Why does the training mixture matter? Give an example where adding more code data hurts dialogue quality.
How would you design a data mixture for a general-purpose LLM?
You added domain-specific data, domain performance improved, but general benchmarks fell—what happened?
What is curriculum learning in pretraining?
How do you estimate the marginal value of adding a new data source to the mixture? What proxy metrics can you track before running a full pretraining job?

Compute and Scaling

What are the main cost drivers in pretraining (parameters, context, number of tokens, precision)?
Explain the Chinchilla scaling law—what is the compute-optimal ratio between parameters and tokens?
Roughly estimate: how many FLOPs does it take to train a 70B model on 2T tokens?
If you double model parameters, how does FLOPs per token change approximately?
Explain data parallelism, tensor parallelism, and pipeline parallelism. When do you use each?
What is ZeRO / FSDP? How does it reduce memory?
What is activation checkpointing? What is the tradeoff?
Explain mixed-precision training (BF16/FP16). Why does BF16 dominate in LLMs?
What is the roofline model? How does it help you reason about GPU utilization in inference?
Estimate the minimum number of H100 GPUs and wall-clock time required to pretrain a 7B-parameter model on 1T tokens. State your assumptions.
What is sequence parallelism? How does it work with tensor parallelism for long-context training?
Explain the difference between Megatron-LM 3D parallelism and DeepSpeed ZeRO-3. When do you use each?

Training Recipe and Monitoring

What is a typical learning-rate schedule for LLM pretraining?
Loss spikes 3× in the middle of training and then slowly recovers—how would you debug it?
What are the 5 most common failure modes in pretraining (instability, contamination, memorization, distribution shift, safety regression)?
How do you set up a probe set to monitor quality during pretraining?
Training loss is going down, but downstream evaluation scores are flat—what went wrong?
When do you decide to stop pretraining? What signals do you watch?
What is gradient norm monitoring? If gradient norm suddenly spikes, what training problem might it indicate? What should you do?

Evaluation and Downstream Impact

What does perplexity measure? What are its limitations?
Why does lower perplexity not always mean better instruction following?
How do pretraining choices constrain mid-training, SFT, and inference?
Your base model is weak at code—should you fix it in pretraining or later?
You pretrained two models with the same architecture but different data mixtures. Model A has lower perplexity, but model B scores higher on downstream tasks. How do you explain that?

Mid-training / Continued Pretraining (CPT)

When to Use It and Why

What is CPT? How is it different from pretraining and SFT?
CPT vs SFT vs prompt engineering—when do you choose which?
The model has high perplexity on legal documents but decent general performance—which training stage should fix that?
CPT uses the same next-token objective as pretraining, but on a different distribution—what does that mean in practice?
A startup wants to specialize an open-source 7B model for medical QA. They have 50,000 clinical notes. Should they do CPT, SFT, or both? Walk through the decision.

Data and Mixing

What is general replay? Why is it critical for CPT?
What replay ratio would you start with in CPT (for example, 80/20)? How would you tune it?
What is document packing in CPT? Why does it matter for throughput?
Should you use curriculum learning in CPT (more replay first, then anneal)?
How does the data format differ between CPT and SFT? Why does packing without attention masking at document boundaries leak information?

The Stability Gap

What is the “stability gap” in CPT? Why does it happen?
General benchmarks collapse right after you start CPT—walk through your debugging steps.
What role does learning rate play in CPT stability?
How does regression gating work in CPT monitoring?
You run 20B tokens of CPT, and MMLU drops by 5 points. After 50B tokens, it partly recovers. Explain this “U-shaped” curve, and how you would set stopping criteria.

Tokenizer Extension

When should you extend the tokenizer during CPT?
How do you initialize embeddings for newly added tokens?
What is the “undertrained token” problem? How do you mitigate it?

Training Topology

Compare the “three topologies”: packing (CPT), masking (SFT), and rollout (RL)—which tokens contribute gradients in each?
How do CPT, SFT, DPO, and RL differ in data construction and where loss is applied?

CPT → RL Compatibility

How does mid-training improve RL scaling (for example, OctoThinker)?
Given a product requirement, how do you decide among prompting → SFT → DPO → RL → CPT → distillation?

Post-training

SFT (Supervised Fine-Tuning)

What is SFT? When is it the first-choice solution?
Describe the four forms of SFT data: single-turn dialogue, multi-turn dialogue, tool-use trajectories, safety demonstrations.
What is the “chat template trap”? How can a single whitespace change break your model?
What is completion-only loss (that is, user-token masking), and why can it improve SFT?
Write the masked SFT loss formula. What do m_t = 0 and m_t = 1 mean?
Your SFT model parrots the user’s prompt back verbatim—what went wrong?
Too many refusal examples in SFT data cause the model to refuse everything—what is this called, and how do you fix it?
How do you ensure training-inference consistency for the chat template?
Your SFT model follows instructions correctly, but it is too verbose—what do you do?
You did SFT on 10,000 examples, benchmark scores look good, but users say the model “sounds robotic.” What is likely wrong with your data? How do you fix it?
How many SFT examples do you typically need? Discuss the “less is more” finding from the LIMA paper versus the need for diversity.

PEFT / LoRA

Explain LoRA—what is low-rank factorization, and why is it cheaper than full fine-tuning?
You use LoRA with rank r = 16 on a 4096×4096 weight matrix—how many trainable parameters is that compared to full fine-tuning?
What is QLoRA? How does it reduce memory even further?
Where should you attach LoRA adapters (attention layers vs MLP), and why?
Merged adapters (single-tenant) vs online stacking (multi-tenant)—when do you use which?
What is adapter multi-tenancy? How do you serve one base model + 100 LoRA adapters?
A routing bug sends tenant A’s request to tenant B’s adapter—what is the impact?
How do you run regression tests for each adapter in a multi-tenant system?
Compare LoRA with other PEFT methods: prefix tuning, Houlsby adapters, IA3. When does LoRA win? When are alternatives better?
You trained a LoRA adapter on LLaMA-3-8B. The base model ships a minor patch release—can you reuse your adapter? What are the risks?

Preference Alignment (DPO / RLHF / PPO)

What is preference learning? Why is “which is better” easier to label than “write the perfect answer”?
Write the KL-regularized reward maximization objective. What does β control?
Explain the full RLHF pipeline: preference labeling → reward model → PPO.
What is reward hacking? Give a concrete example.
Why is full RLHF expensive as a system, not just as a training job?
Explain DPO in one sentence—how does it avoid the full RL loop?
Compare DPO and ORPO—where would you start with each?
What is the main limitation of DPO/ORPO (hint: data coverage)?
“More preferred” does not always mean “more correct”—why does that matter?
Your DPO-tuned model has a higher preference win rate but lower factual accuracy—what happened?
Explain iterative DPO / online DPO. Why is on-policy data generated during training better than fixed offline preference pairs?
What is Constitutional AI (CAI)? How does it reduce dependence on human annotators?
After RLHF, your model becomes extremely sycophantic (“Great question!”). What went wrong in the reward model? How would you fix it?

Tool Use and RAG Training

Why is prompting alone not enough for reliable tool use in production?
Describe the tool-use interaction chain: choose a tool → fill arguments → consume outputs → answer.
What is the difference between underusing tools (hallucinating) and overusing tools (unnecessary calls)?
What is constrained decoding for tool calls? How is schema validity different from semantic correctness?
How is RAG a special case of tool use?
Your agent hallucinates an answer instead of calling the retrieval tool—how do you fix it?
How do you construct training data for multi-turn tool use? Discuss synthetic rollout generation, success-based filtering, and the cold-start problem.

Reasoning and Agent RL

Compare outcome reward models (ORMs) and process reward models (PRMs).
What are STaR/ReST-style self-training loops? Why do they help with trajectory-data scarcity?
Explain GRPO—how does it avoid a learned critic and save memory?
Why is GRPO important for training 70B+ reasoning models?
What is reward hacking in reasoning RL? How does it differ from its manifestation in conversational alignment?
Your reasoning model starts generating longer and longer outputs during RL training—what happened?
Compare Dr. GRPO, GSPO, DAPO, and LUFFY—what knob does each change?
Explain the “aha moment” in DeepSeek R1-Zero. What emergent behavior appeared in RL training without SFT, and why does it matter?
What is the difference between chain-of-thought prompting, chain-of-thought fine-tuning, and training a reasoning model with RL? When do you use each?

Knowledge Distillation

Black-box distillation vs white-box distillation—when do you use which?
Write the standard distillation loss (KL divergence between teacher and student distributions).
Your distilled student matches the teacher on benchmarks but fails on safety—what went wrong?
When the teacher is a closed API, how do you manage teacher version drift?
What is the practical sequence: black-box first, then white-box optimization?
DeepSeek R1 distilled reasoning capability from R1 (671B MoE) into dense student models from 1.5B to 70B. What data did it use? What was the training recipe? How large was the student-teacher gap?

Post-training System Design

Design a “medical scribe”: it should know rare drug names (CPT), but refuse to prescribe medication (alignment + evaluation).
Given a fixed compute budget: DPO (training) or test-time scaling (inference)? How do you decide?
Your SFT model emits incorrect tool-call formats—do you add more data, or switch to constrained decoding?
Your agent fails on multi-step tool chains—what would you change in data, rewards, and search?
You are the ML lead at a legal-tech startup. You have: an open-source 8B model, 100,000 contract documents, 5,000 labeled QA pairs, and a 2-person team. Design the full post-training pipeline from CPT to production.

Common Models and Benchmarks

Architecture Families

Compare BERT, T5, GPT, LLaMA, Mistral, Qwen, GLM, and DeepSeek—what are the core architectural choices of each?
What is the “modern decoder recipe” (RoPE + RMSNorm + SwiGLU + GQA)? Which models use it?
PaLM uses parallel Transformer sublayers—what does that mean, and why?
What is multi-head latent attention (MLA) in DeepSeek V2/V3?
DeepSeek V3 uses multi-token prediction (MTP)—what is it, and why does it help?
How did DeepSeek R1 train reasoning without SFT (R1-Zero)?
What is sliding-window attention in Mistral? What does it sacrifice?
Mixtral is an MoE: 8 experts, top-2 routing, 47B total params / ~13B active—why is it efficient?
Trace the evolution: GPT-2 → GPT-3 → GPT-3.5 → GPT-4. What changed at each step (scale, RLHF, multimodality, MoE)?
Compare LLaMA 1 vs LLaMA 2 vs LLaMA 3. What were the key training-recipe changes in each generation (not just scale)?
How do Gemma/Gemini handle long context (up to 1M tokens)? How is that different from RoPE extrapolation methods?

Evaluation and Benchmarks

List a minimal benchmark suite for a general-purpose LLM (knowledge, math, code, dialogue, safety).
What is pass@k in code benchmarks? How is it different from pass@1?
MMLU vs MMLU-Pro—what is the difference?
What is Chatbot Arena (LMSYS)? Why are Elo-based rankings useful?
Your model scores 90% on MMLU but performs poorly on real customer queries—where is the gap?
What are common evaluation traps (prompt formatting, temperature, contamination, scoring method)?
LLM-as-a-judge: what biases exist, and how do you mitigate them?
When should you use human preference tests instead of automatic benchmarks?
How do you evaluate long-context models (Needle in a Haystack, LongBench)?
What is TruthfulQA? What failure mode does it target?
How do you evaluate multimodal models (MMMU, MathVista)?
How do you detect benchmark contamination (data leakage) in a trained model?
You need to evaluate model safety: which benchmarks and red-teaming methods do you use?
What is the difference between “static” benchmarks (MMLU, HumanEval) and “dynamic” benchmarks (Chatbot Arena, LiveCodeBench)? Why are dynamic benchmarks becoming more important?
Design an internal evaluation suite for a customer-facing chatbot. Which dimensions do you test (accuracy, safety, latency, cost, style)? How do you weight them?

Inference and Compression

Inference Physics

Explain the two phases of LLM inference: prefill vs decode. Why do they have different bottlenecks?
Prefill is compute-bound; decode is memory-bandwidth-bound—explain why.
What is arithmetic intensity? How does it determine whether a workload is compute-bound or memory-bound?
What is TTFT? What dominates it? What is TPOT/ITL? What dominates it?
TTFT is too high—walk through your debugging steps.
TPOT got worse after a configuration change—what do you check?
An H100 has ~3.35 TB/s memory bandwidth and ~990 TFLOPS BF16. A BF16 70B model is ~140GB. At batch size = 1, what is the theoretical maximum tokens/sec during decoding? Where is the bottleneck?

KV Cache

Estimate KV cache size: given layers = 80, KV heads = 8, head dim = 128, FP16—how many bytes per token?
At 8K context, how much KV memory does each sequence require under that calculation?
How does GQA reduce the KV cache compared with MHA? Give the scaling factor.
What is PagedAttention? How is it analogous to virtual memory?
Why does naive contiguous KV allocation waste memory?
What happens when you quantize the KV cache (FP16 → INT8)? Which tasks degrade first?
Your model runs out of memory on long prompts—walk through the mitigation steps.
Compare KV cache compression techniques: quantization, eviction (H2O, StreamingLLM-style sink tokens), sliding windows, cross-layer sharing. Which works best at extreme context lengths?

System Optimization

What is continuous batching / in-flight batching? How is it different from static batching?
Why is admission control important for continuous batching?
What is chunked prefill? How does it solve the convoy effect?
What is prefix caching (prompt caching)? When does it help most?
Explain speculative decoding: draft model → target verification. What is the win condition?
What is guided/constrained decoding? Why is it important for JSON / tool outputs?
Compare vLLM, TGI, TensorRT-LLM, and SGLang—what are the key architectural differences? When should you choose each?

Kernel Optimization

How does FlashAttention reduce HBM traffic?
What is kernel fusion? Why does it help in both prefill and decode?
FlashAttention vs PagedAttention—they solve different problems. Explain.
What is FlashDecoding? How does it parallelize the decode phase over the KV sequence dimension?

Serving Architecture

Why split prefill and decode into different clusters (P/D disaggregation)?
What is multi-LoRA serving (“Bento” mode)?
In a multi-LoRA system, how do you batch requests by adapter_id?
What is “tenant leakage” in multi-adapter serving? How do you prevent it?
How would you design autoscaling for an LLM inference service? Which metrics should trigger scale-out and scale-in (queue depth, GPU utilization, p99 latency)?

Compression

Compare weight-only quantization, activation quantization, and KV quantization—when do you use each?
What is the “outlier problem” in LLM quantization? Why does naive quantization fail?
What are AWQ and SmoothQuant? What does each solve?
Compare unstructured pruning, structured pruning (N:M sparsity), and architectural sparsity (MoE).
What is knowledge distillation for inference? Compare response distillation and logit distillation.
Why can pruning introduce nonlinear “quality cliffs”?
Compare GPTQ, AWQ, and GGML/GGUF quantization. Which is post-training quantization? Which is calibration-based? When do you use each?
What is quantization-aware training (QAT)? How is it different from post-training quantization (PTQ)? When is the extra cost worth it?

Test-time Scaling

What is test-time scaling? How can it improve reliability without retraining?
Compare best-of-N, self-consistency voting, critique-revise loops, and tree search (MCTS).
When should you spend compute on test-time scaling vs retraining?
Explain how process reward models (PRMs) support tree search at inference time. What is the tradeoff between breadth (more candidates) and depth (more reasoning steps)?

Metrics and Evaluation

What are the four core inference metrics (TTFT, TPOT, throughput, $/million tokens)?
How do you set quality regression gates after quantization?
You quantized the model, and Needle-in-a-Haystack accuracy dropped—why?
Users report “unstable latency”—sometimes fast, sometimes 5× slower on similar prompts. Diagnose batching, caching, and queue-depth issues.

System Design Exercises

Why does increasing batch size improve throughput but hurt single-request latency?
Design an inference system for a latency-sensitive chat product.
Design an inference system for batch document processing (throughput > latency).
Your service handles tens of thousands of requests per second, each with long RAG context—what architecture would you propose?
You need to serve a 70B model but only have one A100 80GB. Walk through the options (quantization, offloading, a smaller model, distillation).
Compare cloud inference (OpenAI API, Bedrock, Vertex AI) vs self-hosted (vLLM on GPU, Ollama on Mac)—what factors determine the right choice?

Applications / Agents

Agent Fundamentals

What is an “agentic application”? How is it different from a single LLM call?
Workflow vs agent—when should you use which? Give a decision rule.
Describe the agent control loop: perceive → think → act → check → update state → repeat.
What are common agent products (enterprise search, support triage, deep research, coding agents, ops automation)?
When should you not use an agent?
What is the “bitter lesson” of agents—why can simple designs plus strong models beat complex frameworks?

Architectural Components

What is “policy” in an agent system (model + routing + decoding settings)?
What are output contracts? Why are they critical for production agents?
Explain the validate → retry → repair prompting pattern.
Agent state (structured facts) vs context (the token payload of each step)—what should each contain?
When should you use an explicit planner vs implicit planning in the policy prompt?
What are common orchestration patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer?
Explain the ReAct pattern (reasoning + acting). How is it different from pure chain-of-thought? What failure modes does it have?

Tool Interface Design

What makes a good tool contract (schema, validation, timeouts, retries, idempotency, observability)?
What is MCP (Model Context Protocol)? How does it relate to tool services?
What is idempotency for write tools? Why is it critical?
How do you risk-tier tools (read_low → write_medium → write_high)?
Your agent issued duplicate refunds because a tool was not idempotent—what is the fix?
How do you enforce tool permissions outside the model (allowlists, authorization)?
How do you handle tool errors gracefully? Should error messages be returned to the model? What are the risks?

RAG Pipeline

Walk through the RAG pipeline: ingest → chunking → embeddings → index → retrieval → reranking → generation.
Why does chunk size matter? What happens if chunks are too small or too large?
Compare dense retrieval (embeddings), sparse retrieval (BM25), and hybrid retrieval—when do you use which?
What is cross-encoder reranking? Why is it better than dual-encoder similarity but more expensive?
How do you evaluate retrieval quality (recall@k, MRR, nDCG)?
Your RAG system retrieved the right documents, but the model ignored them—what went wrong?
How do you defend against prompt injection through retrieved documents?
What are query rewriting and multi-hop retrieval? When do you need them?
How do you handle stale indexes in a production RAG system?
Compare vector databases (Pinecone, Weaviate, Qdrant, pgvector)—what are the tradeoffs?
You built a RAG system over a 10-million-document legal corpus. Retrieval latency is acceptable, but answer quality is poor. Diagnose the system: is the problem chunking, the embedding model, retrieval, reranking, or generation?
What is agentic RAG? How is it different from single-shot retrieve-then-generate (for example, iterative retrieval, query decomposition, self-reflection on retrieval quality)?
How do you handle tables, images, and structured data in a RAG pipeline?

Guardrails and Safety

What layers of guardrails should a production agent have (relevance, safety, PII, content moderation, rules, tool safety)?
When should an agent escalate to a human?
How do you handle prompt-injection attacks in agent systems?
Your agent approved an unauthorized action—which layers failed?
What is the difference between a jailbreak (bypassing safety) and prompt injection (hijacking intent)? How do you defend against each?
How do you build red-team evaluations for agent systems? Which attack categories do you cover (direct injection, indirect injection, social engineering, tool abuse)?

Memory and Context Engineering

What is context engineering—and how is it different from prompt engineering?
How do you manage context growth in tool-calling loops that run for 25+ turns?
What is context rot—and why does retrieval precision degrade in very long contexts?
Explain progressive disclosure in agents (metadata → full instructions → deep resources).
What are the context layers: pinned context, working set, retrieved context, summaries?
When should you use a sub-agent architecture to manage context?
How do you implement “memory” for a conversational agent across multiple sessions? Compare explicit memory stores vs retrieval over past conversation logs.

Evaluation and Observability

How do you evaluate an agent as a system (task success, tool accuracy, cost, safety)?
What should an agent observability system log?
Build a debugging funnel for agent failures: tools → retrieval → context → generation → guardrails.
How do you build a replayable evaluation framework with a golden task set?
What does one agent task cost? How do you estimate and control it (token budgets, max iterations, cost caps)?

Skills and Process

What is an Agent Skill? How is it different from tools and prompts?
Explain the progressive-disclosure levels of skills (metadata → full SKILL.md → linked resources).
How do skills and MCP work together in an agent system?

Agent System Design Exercises

Design a “news research agent”: search, verify, and summarize news on a topic, with citations.
Design a support-triage agent with routing, tool use, and human handoff.
Design a coding/debugging agent that loops over code search and test-based verification.
How do you prevent “search → fetch everything” tool abuse in an agent?
Your agent takes 30 seconds per query—where do you look first?
Design an “AI data analyst” agent: it accepts natural-language questions, writes SQL, executes it, interprets results, and visualizes them. What tools does it need? What guardrails?
You are building a multi-agent system in which a “planner” agent delegates to an “executor” agent. If the executor fails halfway through the task, how do you handle failure recovery?

Taxonomy / Cross-Cutting Concepts

Key Distinctions

Capability vs behavior—where does each come from across the LLM lifecycle?
Latency vs throughput—when do you optimize which?
Compute-bound vs memory-bound—how do you tell which regime you are in?
Dense models vs MoE—what are the pros and cons of each?
Online RL (PPO) vs offline preference learning (DPO)—when do you choose which?

Common Misconceptions

“Tokenization is just preprocessing”—why is that wrong?
“Lower perplexity = a better model”—when does that fail?
“The KV cache is just an O(n²)-attention issue”—what else is more important?
“Byte-level BPE = character-level tokenization”—what is the difference?
“Packing more data = a better model”—when can packing hurt instead?
“RAG solves hallucinations”—why is that only partly true?
“A larger context window = better performance”—when can longer context actually hurt?
“Quantized models are always worse”—when can a quantized model beat a larger full-precision model?

End-to-End Design

Trace a single user query through the full stack: tokenization → embeddings → transformations → decoding → serving → return.
How do training choices (architecture, tokenizer, context length) constrain inference?
You are building an LLM-powered product from scratch for a vertical domain—walk through the full stack of technical decisions.
Compare these engineering tradeoffs: latency vs quality, compute vs data, dense vs MoE, tool use vs model reasoning.

Miscellaneous

How does tokenizer choice in pretraining affect KV cache size at inference? Trace the connection: vocabulary size → embedding dimension → KV heads → per-token memory.
You trained a model at 4K context, but RAG at inference needs 128K. Trace the full solution path: positional-encoding extension, KV cache management, chunked prefill, and retrieval-pipeline design.
Compare the cost of improving coding ability via: (a) adding code data in pretraining, (b) code-focused CPT, (c) code SFT, (d) code RL with execution feedback. When is each appropriate?
The model emits incorrect tool-call JSON. Diagnose whether the problem is SFT data quality, tokenizer issues, decoding strategy, or the tool schema definition. How do you tell?
Your production LLM system costs $0.50 per query. The business requires $0.05. Walk through every optimization lever: model size, quantization, batching, caching, distillation, reducing output tokens, switching to MoE.
You are evaluating three competing LLMs for deployment. Model A has the highest maximum quality. Model B is middling but fast at inference. Model C is the smallest and cheapest. How would you build an evaluation for an enterprise customer-support setting, and what would you recommend?

References

Attention Is All You Need (arXiv:1706.03762). The core Transformer architecture, the attention formula, and the multi-head mechanism.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). Bidirectional pretraining and the MLM objective.
Language Models are Unsupervised Multitask Learners (OpenAI, 2019). GPT-2 and the scaling of autoregressive language models.
Language Models are Few-Shot Learners (arXiv:2005.14165). GPT-3, few-shot learning, and in-context learning.
Neural Machine Translation of Rare Words with Subword Units (arXiv:1508.07909). BPE subword tokenization and the OOV tradeoff.
Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). word2vec, a foundational method for word embeddings.
GloVe: Global Vectors for Word Representation (EMNLP 2014). Word embeddings from global co-occurrence statistics.
Scaling Laws for Neural Language Models (arXiv:2001.08361). Scaling relations among parameters, data, and compute.
Training Compute-Optimal Large Language Models (arXiv:2203.15556). The Chinchilla scaling law and the compute-optimal ratio of parameters to tokens.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling (arXiv:2101.00027). Building a large-scale pretraining dataset.
Deduplicating Training Data Makes Language Models Better (arXiv:2107.06499). Why training-data deduplication matters, and how to do it.
LLaMA: Open and Efficient Foundation Language Models (arXiv:2302.13971). The training recipe for an efficient open base model.
Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). The full pipeline across pretraining, SFT, and RLHF.
Training Language Models to Follow Instructions with Human Feedback (arXiv:2203.02155). InstructGPT and the full RLHF pipeline.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (arXiv:2305.18290). DPO, preference alignment without an RL loop.
LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). Low-rank fine-tuning for parameter-efficient adaptation.
QLoRA: Efficient Finetuning of Quantized Language Models (arXiv:2305.14314). LoRA fine-tuning on a quantized base model.
LIMA: Less Is More for Alignment (arXiv:2305.11206). The effectiveness of a small amount of high-quality SFT data.
Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073). An alignment method that replaces part of human labeling with AI feedback.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (arXiv:2402.03300). GRPO, critic-free reasoning RL training.
STaR: Bootstrapping Reasoning With Reasoning (arXiv:2203.14465). A self-training loop for reasoning.
RoFormer: Enhanced Transformer with Rotary Position Embedding (arXiv:2104.09864). The definition and intuition of RoPE.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (arXiv:2108.12409). The motivation behind ALiBi and length extrapolation.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (arXiv:2305.13245). GQA as a quality-efficiency tradeoff.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (arXiv:2312.00752). State-space models for linear-complexity sequence modeling.
Mixtral of Experts (arXiv:2401.04088). A sparse MoE architecture with 8 experts and top-2 routing.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (arXiv:2405.04434). MLA and efficient MoE.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv:2501.12948). Training reasoning via pure RL and distillation.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv:2205.14135). IO-aware tiled attention computation.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv:2307.08691). Improved parallelism and work partitioning.
Efficient Memory Management for Large Language Model Serving with PagedAttention (arXiv:2309.06180). vLLM and virtual-memory-like KV cache management.
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration (arXiv:2306.00978). Activation-aware weight quantization.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (arXiv:2211.10438). A quantization method that smooths activation outliers.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (arXiv:2210.17323). Calibration-based post-training quantization.
Fast Inference from Transformers via Speculative Decoding (arXiv:2211.17192). Speculative decoding, using a draft model to accelerate generation.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arXiv:2005.11401). The original definition and motivation of RAG.
ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629). An agent pattern that alternates reasoning and acting.
Toolformer: Language Models Can Teach Themselves to Use Tools (arXiv:2302.04761). Models teaching themselves to call tools.
Building Effective Agents (Anthropic, 2024). Design principles for production agents.
LLM Powered Autonomous Agents (Lilian Weng, 2023). A survey of agent systems.
MLStack.cafe — LLM Interview Questions (59 questions). Transformer architecture, attention, transfer learning, alignment.
MLStack.cafe — ChatGPT Interview Questions (42 questions). RLHF, tokenization, context handling, evaluation.
MLStack.cafe — NLP Interview Questions (38 questions). Positional encoding, encoder-decoder models, CNN vs Transformer.
Awesome Generative AI Guide — 60 Common Generative AI Interview Questions. Generative models, LLMs, embeddings, multimodality, training, and evaluation.
Hugging Face NLP Course. Chapters 1–4 fundamentals.
Stanford CS324 — Large Language Models. Training, evaluation, social impact.
Full Stack Deep Learning — LLM Bootcamp 2023. End-to-end LLM application development.
Chip Huyen — Designing ML Systems (O’Reilly, 2022). Training, serving, evaluation, data distribution.
Jay Alammar — The Illustrated Transformer. A visual guide to Transformer architecture.
Sebastian Raschka — Build a Large Language Model (From Scratch) (2024). Architecture, pretraining, SFT, alignment.
UC Berkeley CS294/194-196 — Large Language Model Agents. Agent systems, tool use, planning.