6. Applications: Agents and RAG
Overview
This chapter is interview-prep study notes for applying LLMs to agents, RAG, and prompting.
Analogy: think of an agent as a new hire operating a computer—tools are what it can touch (APIs/actions), and skills are the onboarding playbooks it can load on demand (checklists, templates, scripts). You build capable agents by giving them a small set of reliable tools, a library of skills, and a controller loop that decides what to load/use next.
What you’ll learn:
- How to decide workflow vs agent, and when autonomy is worth the variance.
- How to design tools (and optionally MCP) so agents are reliable in production.
- How to think about guardrails, evaluation, and observability as the real sources of reliability.
- A practical RAG pipeline and how to evaluate retrieval (recall@k, MRR, nDCG).
- Prompting patterns that scale: output contracts, validation, and failure-mode thinking.
Agents
What is an agent?
An agent is a system that can plan and act: it iterates over (1) understanding state, (2) selecting an action (often a tool call), (3) observing results, and (4) updating state until a goal is met or it stops.
Useful interview distinction:
- Workflow: LLM + tools orchestrated through predefined code paths (predictable, testable).
- Agent: LLM dynamically directs its own steps and tool usage (flexible, higher variance).
When should you build an agent?
Agents fit workflows that resist deterministic automation:
- Complex decision-making: exceptions, judgment, context-sensitive decisions.
- Brittle rules: rule engines that are expensive to maintain.
- Unstructured data: email threads, docs, tickets, PDFs, free-form user text.
If a deterministic pipeline or a single LLM call + retrieval can meet the target, prefer that.
Common agent products (examples)
- Enterprise search across systems: permissions-aware retrieval + citation-first answers.
- Support triage: routing + tool use + human handoff on edge cases.
- Deep research briefs: planner–executor + multi-hop retrieval + synthesis.
- Coding/debugging agent: ReAct loop with repo search + tests as verification.
- Ops automations (refunds/cancellations): write tools with idempotency + confirmations.
Single-agent design foundations (what to whiteboard)
In its simplest form:
- Model: the LLM used for reasoning and decisions.
- Tools: functions/APIs that read data or take actions.
- Instructions: routines, constraints, and output contracts.
- Controller loop: run until a stop condition.
Typical stop conditions:
- model emits a final structured output,
- model responds with no tool calls,
- a tool errors or times out,
- max turns or max tool calls.
Minimal agent loop (mental model)
while not done(state):
decision = llm(state, context)
if decision.tool_call:
result = tool(decision.args)
state = update(state, result)
else:
return decision.responseTools and MCP (what interviewers care about)
Treat tools as an agent-computer interface (ACI).
MCP (Model Context Protocol): a standardized way to expose tools/resources to an agent with consistent schemas; interview intuition: “a tool integration layer,” not a safety boundary.
Idempotency: make write actions safe to retry (same request ID → same effect, no duplicate refunds/emails).
Allowlist: an explicit list of permitted tools/actions enforced by your application (never delegated to the model).
Tooling checklist:
- Schemas: explicit parameters and return types; validate inputs/outputs.
- Naming: make tool names and arguments obvious; avoid overlapping tools.
- Timeouts/retries: retries with backoff; classify transient vs permanent errors.
- Idempotency: for write actions, add request IDs and make retries safe.
- Observability: log tool name, args, latency, errors, and results.
If using MCP:
- keep a registry of tools + JSON schemas,
- enforce budgets (max calls, max latency),
- implement allowlists/permissions outside the model.
Context and Memory Management
Context = what you pass to the model now.
How to do effective context engineering for AI agents?
- Prompt engineering → context engineering: beyond writing instructions, you’re curating the full token set each turn (instructions, tools, message history, retrieved data, tool outputs).
- Context is a finite “attention budget”: longer contexts can degrade recall/precision (“context rot”); treat tokens as scarce resources with diminishing returns.
- Guiding principle: aim for the smallest high-signal context that reliably produces the desired behavior.
- System prompt altitude (“Goldilocks”): avoid brittle pseudo-code / if-else logic in prompts and avoid vague guidance; be clear, direct, and structured.
- Structure helps: organize instructions into explicit sections (e.g., background, constraints, tool guidance, output spec). This makes it easier to maintain and to debug failures.
- Examples beat rule-lists: prefer a small set of diverse, canonical few-shot examples over “laundry lists” of edge cases.
- Tools are context shapers: design tool contracts for token efficiency and unambiguous choice; bloated/overlapping toolsets increase confusion and waste context.
- Retrieval strategy shift: complement precomputed retrieval (embeddings/RAG) with just-in-time retrieval where the agent keeps lightweight references (paths/URLs/IDs) and pulls details only when needed.
- Progressive disclosure: let the agent explore incrementally; metadata (names, folder structure, timestamps) provides useful signals without loading entire artifacts.
Long-horizon techniques (when the task outgrows the context window)
Compaction: periodically summarize and restart with a high-fidelity condensed state (keep decisions, open issues, next steps; drop redundant chatter).
Tool-result clearing: a low-risk compaction move—once a tool result is “digested,” avoid re-including the raw payload deep in history.
Structured note-taking (agentic memory): persist notes outside the context window (e.g., TODOs, decisions, constraints) and rehydrate them as needed.
Sub-agent architectures: delegate deep exploration to subagents with clean contexts; have them return distilled summaries for the main agent.
Pinned context: system policy + tool schemas.
Working set: recent turns + latest tool results.
Retrieved context: RAG results pulled on-demand.
Summaries: compress long traces into structured state.
Memory = what persists across sessions.
- Short-term: current conversation state.
- Long-term: durable facts/preferences with provenance and TTL.
Common pitfalls:
- mixing instructions with retrieved data (prompt injection risk),
- dumping raw logs into context (token blow-up),
- storing PII without consent.
Orchestration patterns (single + multi-agent)
Start simple and add structure only when it improves measurable outcomes.
Workflows (predefined orchestration paths):
- Prompt chaining: step-by-step decomposition + gates/checks.
- Routing: classify input and send to specialized prompts/models.
- Parallelization: sectioning (subtasks) or voting (multiple attempts).
- Orchestrator–workers: manager decomposes dynamically, workers execute, manager synthesizes.
- Evaluator–optimizer: generate → critique → refine against explicit criteria.
Agentic systems:
- Single-agent loop: one agent + tools + instructions.
- Manager (agents-as-tools): one agent owns the user interaction and delegates.
- Decentralized handoffs: agents transfer control peer-to-peer (good for triage).
Guardrails (layered defenses)
Guardrails should combine multiple layers:
- Relevance: keep the agent on-scope.
- Safety/jailbreak/injection detection.
- PII filtering and redaction at boundaries.
- Moderation for harmful content.
- Rules-based checks: regex/blocklists, input limits, schema validation.
- Tool safeguards: risk-rate tools (read-only vs write, reversibility, $ impact) and require extra checks or user confirmation for high-risk actions.
Human-in-the-loop triggers:
- exceeding failure thresholds (retries, repeated tool errors),
- high-risk actions (refunds, cancellations, prod changes).
Evaluation and observability
Evaluate agents as systems:
- Task success rate on realistic workflows.
- Tool-call accuracy: correct tool, correct args, correct sequencing.
- Time/cost: wall-clock, tokens, tool calls.
- Safety: disallowed tool attempts, policy violations, PII leakage.
Always log:
- prompt version, tools + args, tool outputs, stop reason, and user-visible outcome.
Case study: toy news agent (fast_agent + fast_mcp)
Goal: “Given a topic, autonomously explore the news landscape, iteratively refine queries, verify claims against sources, and produce a cited brief with an explicit stop reason.”
High-level architecture
- Agent client (CLI/UI) collects user intent and displays results.
- Agent runtime (e.g.,
fast_agent) runs the loop: decide → call tools → synthesize. - LLM provider (e.g., OpenRouter) supplies the model used for reasoning + summarization.
- MCP server (e.g.,
fast_mcp) exposes tools likenews.searchandnews.fetch.
agent client: the user-facing surface (CLI, web UI, Slack bot) that sends tasks to an agent runtime and renders results. It should not embed tool logic; keep that in the runtime/tooling layer.
MCP (Model Context Protocol): a standard protocol for advertising tools/resources to an agent runtime with consistent contracts. Interview intuition: it’s an integration layer that lets one client talk to many tools safely and consistently.
OpenRouter: a hosted API that routes a single OpenAI-compatible request format to many different model providers. Interview intuition: it’s a convenient way to swap models (or use multiple models) without rewriting your client.
Why an LLM “needs” MCP (interview framing)
- The model itself doesn’t speak to your databases or APIs; an agent runtime does.
- MCP makes “tool discovery + invocation” consistent across tools and across clients.
- MCP centralizes governance: tool allowlists, permissions, and budgets can be enforced outside the model.
How MCP acts as an adapter
- Agent client ↔︎ agent runtime: UX and session management.
- Agent runtime ↔︎ LLM: completion/chat API (OpenAI-style, OpenRouter, etc.).
- Agent runtime ↔︎ tools: MCP is the adapter layer that standardizes tool contracts and lets you register many tools behind one interface.
Toy toolset (registered on the MCP server)
news.search(query, since_days, limit) -> [{title, url, source, published_at}]news.fetch(url) -> {url, title, text}news.summarize(text, max_bullets) -> {bullets}(optional; many teams do summarization in the LLM instead)
A tool contract is the strict input/output interface an agent runtime relies on (names, types, required fields, error semantics). It’s what makes tool calls testable and safe.
Important features of tool contracts (what interviewers look for)
- Deterministic outputs where possible (tools should not be “creative”).
- Explicit types/shape: required vs optional fields; stable field names.
- Clear failure semantics: structured errors, retryable vs non-retryable.
- Timeouts and bounded work (tool must fail fast rather than hang).
- Idempotency for writes (request IDs) and safe defaults.
- Auth/permissions handled by the runtime, not the model.
- Observability hooks: request IDs, tool name, latency, and minimal logging of sensitive fields.
Agentic loop (what you’d explain on a whiteboard)
- Client sends intent + constraints (topic, time window, preferred sources, max budget).
- Agent proposes a short research plan (subtopics, query variants, coverage targets).
- Agent runs iterative cycles:
- call
news.searchwith a query variant, - call
news.fetchfor the most promising sources, - extract candidate claims + attach citations,
- assess coverage/novelty/credibility; decide what’s missing.
- call
- Agent adapts: rewrite queries, broaden/narrow scope, or request clarification if constraints conflict.
- Stop when one condition is met: coverage threshold reached, marginal gains fall below a threshold, budget exhausted, or sources conflict (trigger “needs human review”).
- Return a structured brief (key findings + citations + confidence + stop reason).
Example skeleton (template, APIs vary by library)
MCP server (fast_mcp-style pseudocode):
# mcp_server.py
# Pseudocode: adjust imports/APIs to your fast_mcp version.
@tool(name="news.search")
def search(query: str, since_days: int = 7, limit: int = 5) -> list[dict]:
"""Return a list of candidate articles with URL + metadata."""
...
@tool(name="news.fetch")
def fetch(url: str) -> dict:
"""Fetch article text from a URL."""
...
def main():
server = FastMCPServer(tools=[search, fetch])
server.run()Agent runtime (fast_agent-style pseudocode):
# agent_client.py
# Pseudocode: emphasize an adaptive loop (plan → act → observe → decide).
mcp = MCPClient("http://localhost:8000")
llm = OpenRouterChat(model="openai/gpt-4o-mini", api_key=os.environ["OPENROUTER_API_KEY"])
goal = {
"topic": "AI agents",
"since_days": 7,
"max_sources": 12,
"max_tool_calls": 25,
}
state = {
"queries_tried": [],
"sources": [], # {url, title, source, published_at, text}
"claims": [], # {claim, citations:[url], confidence}
"stop_reason": None,
}
while True:
if len(state["sources"]) >= goal["max_sources"]:
state["stop_reason"] = "source_budget_reached"
break
if tool_calls_used() >= goal["max_tool_calls"]:
state["stop_reason"] = "tool_budget_reached"
break
plan = llm(
"Propose the next best action (search/fetch), "
"a query variant if needed, and what coverage is missing.",
context={"goal": goal, "state": state},
)
if plan.action == "search":
hits = mcp.call("news.search", query=plan.query, since_days=goal["since_days"], limit=10)
state["queries_tried"].append(plan.query)
state = update_state_with_hits(state, hits)
elif plan.action == "fetch":
doc = mcp.call("news.fetch", url=plan.url)
state = update_state_with_doc(state, doc)
elif plan.action == "stop":
state["stop_reason"] = plan.reason # e.g., "coverage_sufficient" / "sources_conflict"
break
# Extract/verify claims against collected sources (LLM-as-judge + strict citation rules).
state["claims"] = llm(
"Extract candidate claims and attach citations; "
"if a claim lacks support, mark it as uncertain.",
context={"sources": state["sources"]},
)
brief = llm(
"Write a structured briefing with citations, confidence, and stop_reason.",
context={"goal": goal, "claims": state["claims"], "stop_reason": state["stop_reason"]},
)
print(brief)A bunch of other frameworks (including MCP-based stacks) support tool use in essentially the same way—this landscape is changing fast:
- LangChain / LangGraph: large ecosystem, graph-based orchestration.
- LlamaIndex: strong RAG + data connectors.
- Semantic Kernel: enterprise-friendly agent + tool abstractions.
- AutoGen / CrewAI: multi-agent coordination patterns.
- Haystack: retrieval pipelines + production patterns.
- DSPy / PydanticAI: programmatic prompting / typed structured outputs.
- MCP ecosystem: tool servers + client runtimes that standardize tool wiring.
Interview drills (fast Q&A)
System design template (talk track)
One framework to answer most “Design an agent that does X” questions:
- 1) Clarify X + constraints: what’s the input, output, latency/SLA, and what’s out of scope?
- 2) Define success + risk metrics: task success rate, cost/latency, escalation rate, and domain-specific error rates (e.g., false approvals).
- 3) Choose workflow vs agent: start with the simplest deterministic workflow that can hit the target; justify autonomy only for long-tail decisions.
- 4) Specify tools + contracts: list read vs write tools, schemas, retries/timeouts, and idempotency for writes; enforce permissions/allowlists outside the model.
- 5) Design state + context: what’s pinned (policy/tools), what’s retrieved (RAG), what’s in working memory, and what persists (notes/memory with provenance + TTL).
- 6) Orchestration: single-agent loop first; add routing / planner–executor / evaluators / subagents only when it improves measured outcomes.
- 7) Guardrails + human handoff: risk-rate tools/actions, require confirmation for high-blast-radius writes, and define clear escalation triggers.
- 8) Evals + rollout: build replayable test sets, measure tool-call correctness + safety, ship in shadow/assist mode, then tighten budgets and monitoring.
“For the toy news agent, what should the tool contracts look like?”
news.search: strict schema, bounded results, stable IDs/URLs, and clear error semantics (retryable vs not).news.fetch: deterministic extraction, timeouts, size limits, and normalized fields (title/source/published_at/text).- Prefer returning summaries + pointers over huge payloads; keep token cost predictable.
“How do you prevent ‘search → fetch everything’ tool spam?”
- Budgets: max tool calls, max sources, max tokens per tool result; stop when marginal gain falls.
- Planner step must emit an action + rationale + stop criteria; log
stop_reason. - Add lightweight heuristics: dedupe by URL/domain, prefer primary sources, diversify outlets.
“How do you evaluate whether the agent is actually doing good retrieval?”
- Retrieval metrics: recall@k / MRR / nDCG on a labeled set of “topic → relevant articles”.
- Behavioral metrics: source diversity, freshness, redundancy rate, and citation coverage (claims with ≥2 independent sources).
- End-to-end: briefing faithfulness (citations support claims), contradiction rate, and abstention quality.
“How do you handle context growth over a 25-tool-call loop?”
- Treat context as an attention budget: keep raw tool outputs out of the prompt once digested.
- Maintain structured state (queries tried, sources metadata, extracted claims) and compact periodically.
- Retrieve just-in-time via references (URLs/IDs) rather than carrying full documents.
“Where do agents like this fail in practice, technically?”
- Ambiguous tools or overlapping capabilities → wrong tool choice.
- Context pollution: too many low-signal snippets → missed key facts (“context rot”).
- Citation drift: model states claims without anchoring to source spans.
- Latency/cost blowups: unconstrained fetching, reranking, or repeated query rewrites.
“How do you make this safe and robust?”
- Treat retrieved text as data: isolate it, forbid it from overriding instructions, and sanitize HTML.
- Enforce allowlists + permissions outside the model; risk-rate tools even if they’re read-only.
- Add validators: schema checks, citation-required checks, and retries with explicit error messages.
- Human-in-the-loop triggers: conflicting sources, low confidence, repeated tool errors, or budget exhaustion.
Skills
Claude Skills (Agent Skills) are reusable, composable packages of instructions + scripts + resources (files/folders) that equip an agent with domain-specific procedural knowledge—like writing an onboarding guide for a new hire.
Mental model
- Skills = “procedural expertise” you install once instead of re-pasting long prompts.
- Progressive disclosure keeps context lean: only lightweight metadata is always present; details load on-demand.
- Skills can include deterministic scripts for steps where token-by-token generation is slow or unreliable.
How Skills work (high level)
- Startup: the agent pre-loads each skill’s
name+description(from YAML frontmatter) into the system prompt. - Relevance check: on a user task, the model decides whether to trigger a skill (similar to tool selection).
- On-demand load: when triggered, the agent reads the full
SKILL.mdinto context. - Deeper load:
SKILL.mdcan reference additional files (e.g.,forms.md,reference.md); the model reads them only if needed. - Code execution (optional): skills may include scripts that the agent can run for deterministic steps, without needing to load the entire script (or large inputs) into the context window.
Skill folder structure
- A skill is just a directory (e.g.,
changelog-newsletter/). - Required:
SKILL.mdat the root.- Starts with YAML frontmatter:
name: Changelog to Newsletter
description: Convert git changelogs into a reader-friendly newsletter- Followed by Markdown instructions, checklists, examples, and links to other files.
- Optional:
templates.md,examples.md,reference.mdscripts/with*.pyhelpers for parsing/validation/formattingresources/with sample inputs/outputs
Progressive disclosure levels (useful interview phrasing)
- Level 1: metadata only (
name+description) — always present. - Level 2: full
SKILL.md— loaded when the skill is triggered. - Level 3+: linked files/resources/scripts — loaded/executed only when relevant.
Skills vs prompting vs tools (interview answers)
- Are Skills “prompt tech”?
- Skills use prompting techniques (instructions, few-shot examples, templates), but they’re not just prompts: they add packaging, persistence, and on-demand loading.
- Skills vs prompting:
- Prompts are typically one-off or session-scoped; skills are versionable artifacts you can share across users/sessions and keep token-efficient via progressive disclosure.
- Skills vs tool calls:
- Tools provide the capability/action (“do X / fetch Y”); skills provide the playbook (“how to do X reliably”) and can teach consistent orchestration (plus optional scripts).
Skills + MCP (how they fit together)
- MCP provides standardized access to tools/data sources (“connectivity layer”).
- Skills provide the organization’s workflow, heuristics, and output formats (“operating procedure”).
- Combined pattern: MCP exposes
news.search/news.fetch, and a “News Briefing” skill defines query strategies, citation rules, budgets/stop reasons, and validation steps.
Best practices
- Keep skills narrow and composable; split big domains into multiple skills.
- Write descriptions so triggering is reliable: when to use + what output to produce.
- Prefer canonical examples over exhaustive edge-case lists.
- Start with evaluation: observe where the agent fails on representative tasks, then build skills incrementally to patch those gaps.
- Audit any scripts/resources (treat as production code): least privilege, safe defaults, and clear inputs/outputs.
Security considerations
- Install skills only from trusted sources; audit instructions, scripts, and bundled resources.
- Watch for instructions or dependencies that could enable unsafe actions or data exfiltration.
RAG (Retrieval-Augmented Generation)
RAG retrieves relevant context from a corpus and then generates an answer conditioned on that retrieved context.
Baseline pipeline
- Ingest: parse docs, extract text, attach metadata.
- Chunk: split into retrieval units (preserve structure and headings).
- Embed: generate vectors for chunks (dense, sparse, or both).
- Index: store vectors + metadata.
- Retrieve: top-\(k\) candidates for a query.
- Rerank (optional): improve ordering using a stronger model.
- Generate: answer with citations and constraints.
Chunking (the lever people underestimate)
- Too small: loses context; retrieval gets noisy.
- Too big: irrelevant text dilutes signal and increases tokens.
- Align splits to structure (headings, paragraphs, code blocks).
- Preserve metadata: title, section, URL/path, timestamp, permissions.
Retrieval
- Dense (embeddings): semantic similarity.
- Sparse (BM25): keyword matching.
- Hybrid: combine dense + sparse for robustness.
- Use metadata filters (product, date, language, access scope).
BM25 is a classic keyword-based ranking function used in sparse retrieval; it’s strong when exact terms matter (IDs, error codes, product names) and complements embeddings.
Reranking
- Cross-encoder rerankers are strong but add latency.
- Heuristic reranking helps (freshness, authority, doc type).
A cross-encoder scores a (query, document) pair together (full attention across both), which usually improves ranking quality vs embedding similarity, but increases latency/cost.
Generation and citations
- Provide retrieved passages as quoted context; require citations.
- Enforce constraints: “If not in sources, say you don’t know.”
- Consider a two-step: draft → verify against citations.
Common failure modes
- good retrieval, bad generation (model ignores context),
- bad retrieval (embedding mismatch, chunking issues, no hybrid),
- prompt injection from retrieved docs (treat as data),
- stale indexes (no incremental ingestion or TTL).
Evals and operations
- Retrieval: recall@k, MRR, nDCG.
- Answers: faithfulness, completeness, helpfulness (human eval for ambiguity).
- Observability: log doc IDs, scores, chunks shown, rerank results.
- Cost controls: context budgets, caching embeddings/retrieval, summarization.
recall@k: whether at least one relevant chunk/doc appears in the top k retrieved results.
MRR (Mean Reciprocal Rank): emphasizes ranking the first relevant result near the top (higher when the first relevant item appears earlier).
nDCG (normalized Discounted Cumulative Gain): measures ranking quality when relevance can be graded (very relevant vs somewhat relevant), rewarding good ordering.
Enhancements (practical)
- query rewriting,
- multi-hop retrieval,
- self-ask (subquestions per hop),
- structured retrieval (tables/JSON extraction before generation).
References
- OpenAI: A practical guide to building agents
- Google Cloud / Workspace: AI agents for business (Gemini Enterprise) — TODO: add direct link
- Anthropic: Building effective agents
- Anthropic: Effective context engineering for AI agents
- Anthropic: Equipping agents for the real world with Agent Skills
- Agentic design patterns (router / planner-executor / graphs) — TODO: add canonical link