6. Applications: Agents and RAG

Overview

This chapter is interview-prep study notes for applying LLMs to agents, RAG, and prompting.

Analogy: think of an agent as a new hire operating a computer—tools are what it can touch (APIs/actions), and skills are the onboarding playbooks it can load on demand (checklists, templates, scripts). You build capable agents by giving them a small set of reliable tools, a library of skills, and a controller loop that decides what to load/use next.

What you’ll learn:

How to decide workflow vs agent, and when autonomy is worth the variance.
How to design tools (and optionally MCP) so agents are reliable in production.
How to think about guardrails, evaluation, and observability as the real sources of reliability.
A practical RAG pipeline and how to evaluate retrieval (recall@k, MRR, nDCG).
Prompting patterns that scale: output contracts, validation, and failure-mode thinking.

Agents

What is an agent?

An agent is a system that can plan and act: it iterates over (1) understanding state, (2) selecting an action (often a tool call), (3) observing results, and (4) updating state until a goal is met or it stops.

Useful interview distinction:

Workflow: LLM + tools orchestrated through predefined code paths (predictable, testable).
Agent: LLM dynamically directs its own steps and tool usage (flexible, higher variance).

When should you build an agent?

Agents fit workflows that resist deterministic automation:

Complex decision-making: exceptions, judgment, context-sensitive decisions.
Brittle rules: rule engines that are expensive to maintain.
Unstructured data: email threads, docs, tickets, PDFs, free-form user text.

If a deterministic pipeline or a single LLM call + retrieval can meet the target, prefer that.

Common agent products (examples)

Enterprise search across systems: permissions-aware retrieval + citation-first answers.
Support triage: routing + tool use + human handoff on edge cases.
Deep research briefs: planner–executor + multi-hop retrieval + synthesis.
Coding/debugging agent: ReAct loop with repo search + tests as verification.
Ops automations (refunds/cancellations): write tools with idempotency + confirmations.

Single-agent design foundations (what to whiteboard)

In its simplest form:

Model: the LLM used for reasoning and decisions.
Tools: functions/APIs that read data or take actions.
Instructions: routines, constraints, and output contracts.
Controller loop: run until a stop condition.

Typical stop conditions:

model emits a final structured output,
model responds with no tool calls,
a tool errors or times out,
max turns or max tool calls.

Minimal agent loop (mental model)

while not done(state):
    decision = llm(state, context)
    if decision.tool_call:
        result = tool(decision.args)
        state = update(state, result)
    else:
        return decision.response

Tools and MCP (what interviewers care about)

Treat tools as an agent-computer interface (ACI).

Definitions (tool reliability)

MCP (Model Context Protocol): a standardized way to expose tools/resources to an agent with consistent schemas; interview intuition: “a tool integration layer,” not a safety boundary.

Idempotency: make write actions safe to retry (same request ID → same effect, no duplicate refunds/emails).

Allowlist: an explicit list of permitted tools/actions enforced by your application (never delegated to the model).

Tooling checklist:

Schemas: explicit parameters and return types; validate inputs/outputs.
Naming: make tool names and arguments obvious; avoid overlapping tools.
Timeouts/retries: retries with backoff; classify transient vs permanent errors.
Idempotency: for write actions, add request IDs and make retries safe.
Observability: log tool name, args, latency, errors, and results.

If using MCP:

keep a registry of tools + JSON schemas,
enforce budgets (max calls, max latency),
implement allowlists/permissions outside the model.

Context and Memory Management

Context = what you pass to the model now.

How to do effective context engineering for AI agents?

Prompt engineering → context engineering: beyond writing instructions, you’re curating the full token set each turn (instructions, tools, message history, retrieved data, tool outputs).
Context is a finite “attention budget”: longer contexts can degrade recall/precision (“context rot”); treat tokens as scarce resources with diminishing returns.
Guiding principle: aim for the smallest high-signal context that reliably produces the desired behavior.
System prompt altitude (“Goldilocks”): avoid brittle pseudo-code / if-else logic in prompts and avoid vague guidance; be clear, direct, and structured.
Structure helps: organize instructions into explicit sections (e.g., background, constraints, tool guidance, output spec). This makes it easier to maintain and to debug failures.
Examples beat rule-lists: prefer a small set of diverse, canonical few-shot examples over “laundry lists” of edge cases.
Tools are context shapers: design tool contracts for token efficiency and unambiguous choice; bloated/overlapping toolsets increase confusion and waste context.
Retrieval strategy shift: complement precomputed retrieval (embeddings/RAG) with just-in-time retrieval where the agent keeps lightweight references (paths/URLs/IDs) and pulls details only when needed.
Progressive disclosure: let the agent explore incrementally; metadata (names, folder structure, timestamps) provides useful signals without loading entire artifacts.

Long-horizon techniques (when the task outgrows the context window)

Compaction: periodically summarize and restart with a high-fidelity condensed state (keep decisions, open issues, next steps; drop redundant chatter).
Tool-result clearing: a low-risk compaction move—once a tool result is “digested,” avoid re-including the raw payload deep in history.
Structured note-taking (agentic memory): persist notes outside the context window (e.g., TODOs, decisions, constraints) and rehydrate them as needed.
Sub-agent architectures: delegate deep exploration to subagents with clean contexts; have them return distilled summaries for the main agent.
Pinned context: system policy + tool schemas.
Working set: recent turns + latest tool results.
Retrieved context: RAG results pulled on-demand.
Summaries: compress long traces into structured state.

Memory = what persists across sessions.

Short-term: current conversation state.
Long-term: durable facts/preferences with provenance and TTL.

Common pitfalls:

mixing instructions with retrieved data (prompt injection risk),
dumping raw logs into context (token blow-up),
storing PII without consent.

Orchestration patterns (single + multi-agent)

Start simple and add structure only when it improves measurable outcomes.

Workflows (predefined orchestration paths):

Prompt chaining: step-by-step decomposition + gates/checks.
Routing: classify input and send to specialized prompts/models.
Parallelization: sectioning (subtasks) or voting (multiple attempts).
Orchestrator–workers: manager decomposes dynamically, workers execute, manager synthesizes.
Evaluator–optimizer: generate → critique → refine against explicit criteria.

Agentic systems:

Single-agent loop: one agent + tools + instructions.
Manager (agents-as-tools): one agent owns the user interaction and delegates.
Decentralized handoffs: agents transfer control peer-to-peer (good for triage).

Guardrails (layered defenses)

Guardrails should combine multiple layers:

Relevance: keep the agent on-scope.
Safety/jailbreak/injection detection.
PII filtering and redaction at boundaries.
Moderation for harmful content.
Rules-based checks: regex/blocklists, input limits, schema validation.
Tool safeguards: risk-rate tools (read-only vs write, reversibility, $ impact) and require extra checks or user confirmation for high-risk actions.

Human-in-the-loop triggers:

exceeding failure thresholds (retries, repeated tool errors),
high-risk actions (refunds, cancellations, prod changes).

Evaluation and observability

Evaluate agents as systems:

Task success rate on realistic workflows.
Tool-call accuracy: correct tool, correct args, correct sequencing.
Time/cost: wall-clock, tokens, tool calls.
Safety: disallowed tool attempts, policy violations, PII leakage.

Always log:

prompt version, tools + args, tool outputs, stop reason, and user-visible outcome.

Case study: toy news agent (fast_agent + fast_mcp)

Goal: “Given a topic, autonomously explore the news landscape, iteratively refine queries, verify claims against sources, and produce a cited brief with an explicit stop reason.”

High-level architecture

Agent client (CLI/UI) collects user intent and displays results.
Agent runtime (e.g., fast_agent) runs the loop: decide → call tools → synthesize.
LLM provider (e.g., OpenRouter) supplies the model used for reasoning + summarization.
MCP server (e.g., fast_mcp) exposes tools like news.search and news.fetch.

Definitions (MCP + client + model routing)

agent client: the user-facing surface (CLI, web UI, Slack bot) that sends tasks to an agent runtime and renders results. It should not embed tool logic; keep that in the runtime/tooling layer.

MCP (Model Context Protocol): a standard protocol for advertising tools/resources to an agent runtime with consistent contracts. Interview intuition: it’s an integration layer that lets one client talk to many tools safely and consistently.

OpenRouter: a hosted API that routes a single OpenAI-compatible request format to many different model providers. Interview intuition: it’s a convenient way to swap models (or use multiple models) without rewriting your client.

Why an LLM “needs” MCP (interview framing)

The model itself doesn’t speak to your databases or APIs; an agent runtime does.
MCP makes “tool discovery + invocation” consistent across tools and across clients.
MCP centralizes governance: tool allowlists, permissions, and budgets can be enforced outside the model.

How MCP acts as an adapter

Agent client ↔︎ agent runtime: UX and session management.
Agent runtime ↔︎ LLM: completion/chat API (OpenAI-style, OpenRouter, etc.).
Agent runtime ↔︎ tools: MCP is the adapter layer that standardizes tool contracts and lets you register many tools behind one interface.

Toy toolset (registered on the MCP server)

news.search(query, since_days, limit) -> [{title, url, source, published_at}]
news.fetch(url) -> {url, title, text}
news.summarize(text, max_bullets) -> {bullets} (optional; many teams do summarization in the LLM instead)

Definition: tool contract

A tool contract is the strict input/output interface an agent runtime relies on (names, types, required fields, error semantics). It’s what makes tool calls testable and safe.

Important features of tool contracts (what interviewers look for)

Deterministic outputs where possible (tools should not be “creative”).
Explicit types/shape: required vs optional fields; stable field names.
Clear failure semantics: structured errors, retryable vs non-retryable.
Timeouts and bounded work (tool must fail fast rather than hang).
Idempotency for writes (request IDs) and safe defaults.
Auth/permissions handled by the runtime, not the model.
Observability hooks: request IDs, tool name, latency, and minimal logging of sensitive fields.

Agentic loop (what you’d explain on a whiteboard)

Client sends intent + constraints (topic, time window, preferred sources, max budget).
Agent proposes a short research plan (subtopics, query variants, coverage targets).
Agent runs iterative cycles:
- call news.search with a query variant,
- call news.fetch for the most promising sources,
- extract candidate claims + attach citations,
- assess coverage/novelty/credibility; decide what’s missing.
Agent adapts: rewrite queries, broaden/narrow scope, or request clarification if constraints conflict.
Stop when one condition is met: coverage threshold reached, marginal gains fall below a threshold, budget exhausted, or sources conflict (trigger “needs human review”).
Return a structured brief (key findings + citations + confidence + stop reason).

Example skeleton (template, APIs vary by library)

MCP server (fast_mcp-style pseudocode):

# mcp_server.py

# Pseudocode: adjust imports/APIs to your fast_mcp version.

@tool(name="news.search")
def search(query: str, since_days: int = 7, limit: int = 5) -> list[dict]:
    """Return a list of candidate articles with URL + metadata."""
    ...


@tool(name="news.fetch")
def fetch(url: str) -> dict:
    """Fetch article text from a URL."""
    ...


def main():
    server = FastMCPServer(tools=[search, fetch])
    server.run()

Agent runtime (fast_agent-style pseudocode):

# agent_client.py

# Pseudocode: emphasize an adaptive loop (plan → act → observe → decide).

mcp = MCPClient("http://localhost:8000")
llm = OpenRouterChat(model="openai/gpt-4o-mini", api_key=os.environ["OPENROUTER_API_KEY"])

goal = {
    "topic": "AI agents",
    "since_days": 7,
    "max_sources": 12,
    "max_tool_calls": 25,
}

state = {
    "queries_tried": [],
    "sources": [],  # {url, title, source, published_at, text}
    "claims": [],   # {claim, citations:[url], confidence}
    "stop_reason": None,
}

while True:
    if len(state["sources"]) >= goal["max_sources"]:
        state["stop_reason"] = "source_budget_reached"
        break
    if tool_calls_used() >= goal["max_tool_calls"]:
        state["stop_reason"] = "tool_budget_reached"
        break

    plan = llm(
        "Propose the next best action (search/fetch), "
        "a query variant if needed, and what coverage is missing.",
        context={"goal": goal, "state": state},
    )

    if plan.action == "search":
        hits = mcp.call("news.search", query=plan.query, since_days=goal["since_days"], limit=10)
        state["queries_tried"].append(plan.query)
        state = update_state_with_hits(state, hits)

    elif plan.action == "fetch":
        doc = mcp.call("news.fetch", url=plan.url)
        state = update_state_with_doc(state, doc)

    elif plan.action == "stop":
        state["stop_reason"] = plan.reason  # e.g., "coverage_sufficient" / "sources_conflict"
        break

    # Extract/verify claims against collected sources (LLM-as-judge + strict citation rules).
    state["claims"] = llm(
        "Extract candidate claims and attach citations; "
        "if a claim lacks support, mark it as uncertain.",
        context={"sources": state["sources"]},
    )

brief = llm(
    "Write a structured briefing with citations, confidence, and stop_reason.",
    context={"goal": goal, "claims": state["claims"], "stop_reason": state["stop_reason"]},
)
print(brief)

A bunch of other frameworks (including MCP-based stacks) support tool use in essentially the same way—this landscape is changing fast:

LangChain / LangGraph: large ecosystem, graph-based orchestration.
LlamaIndex: strong RAG + data connectors.
Semantic Kernel: enterprise-friendly agent + tool abstractions.
AutoGen / CrewAI: multi-agent coordination patterns.
Haystack: retrieval pipelines + production patterns.
DSPy / PydanticAI: programmatic prompting / typed structured outputs.
MCP ecosystem: tool servers + client runtimes that standardize tool wiring.

Interview drills (fast Q&A)

System design template (talk track)

One framework to answer most “Design an agent that does X” questions:

1) Clarify X + constraints: what’s the input, output, latency/SLA, and what’s out of scope?
2) Define success + risk metrics: task success rate, cost/latency, escalation rate, and domain-specific error rates (e.g., false approvals).
3) Choose workflow vs agent: start with the simplest deterministic workflow that can hit the target; justify autonomy only for long-tail decisions.
4) Specify tools + contracts: list read vs write tools, schemas, retries/timeouts, and idempotency for writes; enforce permissions/allowlists outside the model.
5) Design state + context: what’s pinned (policy/tools), what’s retrieved (RAG), what’s in working memory, and what persists (notes/memory with provenance + TTL).
6) Orchestration: single-agent loop first; add routing / planner–executor / evaluators / subagents only when it improves measured outcomes.
7) Guardrails + human handoff: risk-rate tools/actions, require confirmation for high-blast-radius writes, and define clear escalation triggers.
8) Evals + rollout: build replayable test sets, measure tool-call correctness + safety, ship in shadow/assist mode, then tighten budgets and monitoring.

“For the toy news agent, what should the tool contracts look like?”

news.search: strict schema, bounded results, stable IDs/URLs, and clear error semantics (retryable vs not).
news.fetch: deterministic extraction, timeouts, size limits, and normalized fields (title/source/published_at/text).
Prefer returning summaries + pointers over huge payloads; keep token cost predictable.

“How do you prevent ‘search → fetch everything’ tool spam?”

Budgets: max tool calls, max sources, max tokens per tool result; stop when marginal gain falls.
Planner step must emit an action + rationale + stop criteria; log stop_reason.
Add lightweight heuristics: dedupe by URL/domain, prefer primary sources, diversify outlets.

“How do you evaluate whether the agent is actually doing good retrieval?”

Retrieval metrics: recall@k / MRR / nDCG on a labeled set of “topic → relevant articles”.
Behavioral metrics: source diversity, freshness, redundancy rate, and citation coverage (claims with ≥2 independent sources).
End-to-end: briefing faithfulness (citations support claims), contradiction rate, and abstention quality.

“How do you handle context growth over a 25-tool-call loop?”

Treat context as an attention budget: keep raw tool outputs out of the prompt once digested.
Maintain structured state (queries tried, sources metadata, extracted claims) and compact periodically.
Retrieve just-in-time via references (URLs/IDs) rather than carrying full documents.

“Where do agents like this fail in practice, technically?”

Ambiguous tools or overlapping capabilities → wrong tool choice.
Context pollution: too many low-signal snippets → missed key facts (“context rot”).
Citation drift: model states claims without anchoring to source spans.
Latency/cost blowups: unconstrained fetching, reranking, or repeated query rewrites.

“How do you make this safe and robust?”

Treat retrieved text as data: isolate it, forbid it from overriding instructions, and sanitize HTML.
Enforce allowlists + permissions outside the model; risk-rate tools even if they’re read-only.
Add validators: schema checks, citation-required checks, and retries with explicit error messages.
Human-in-the-loop triggers: conflicting sources, low confidence, repeated tool errors, or budget exhaustion.

Skills

Claude Skills (Agent Skills) are reusable, composable packages of instructions + scripts + resources (files/folders) that equip an agent with domain-specific procedural knowledge—like writing an onboarding guide for a new hire.

Mental model

Skills = “procedural expertise” you install once instead of re-pasting long prompts.
Progressive disclosure keeps context lean: only lightweight metadata is always present; details load on-demand.
Skills can include deterministic scripts for steps where token-by-token generation is slow or unreliable.

How Skills work (high level)

Startup: the agent pre-loads each skill’s name + description (from YAML frontmatter) into the system prompt.
Relevance check: on a user task, the model decides whether to trigger a skill (similar to tool selection).
On-demand load: when triggered, the agent reads the full SKILL.md into context.
Deeper load: SKILL.md can reference additional files (e.g., forms.md, reference.md); the model reads them only if needed.
Code execution (optional): skills may include scripts that the agent can run for deterministic steps, without needing to load the entire script (or large inputs) into the context window.

Skill folder structure

A skill is just a directory (e.g., changelog-newsletter/).
Required: SKILL.md at the root.
- Starts with YAML frontmatter:

name: Changelog to Newsletter
description: Convert git changelogs into a reader-friendly newsletter

- Followed by Markdown instructions, checklists, examples, and links to other files.

Optional:
- templates.md, examples.md, reference.md
- scripts/ with *.py helpers for parsing/validation/formatting
- resources/ with sample inputs/outputs

Progressive disclosure levels (useful interview phrasing)

Level 1: metadata only (name + description) — always present.
Level 2: full SKILL.md — loaded when the skill is triggered.
Level 3+: linked files/resources/scripts — loaded/executed only when relevant.

Skills vs prompting vs tools (interview answers)

Are Skills “prompt tech”?
- Skills use prompting techniques (instructions, few-shot examples, templates), but they’re not just prompts: they add packaging, persistence, and on-demand loading.
Skills vs prompting:
- Prompts are typically one-off or session-scoped; skills are versionable artifacts you can share across users/sessions and keep token-efficient via progressive disclosure.
Skills vs tool calls:
- Tools provide the capability/action (“do X / fetch Y”); skills provide the playbook (“how to do X reliably”) and can teach consistent orchestration (plus optional scripts).

Skills + MCP (how they fit together)

MCP provides standardized access to tools/data sources (“connectivity layer”).
Skills provide the organization’s workflow, heuristics, and output formats (“operating procedure”).
Combined pattern: MCP exposes news.search/news.fetch, and a “News Briefing” skill defines query strategies, citation rules, budgets/stop reasons, and validation steps.

Best practices

Keep skills narrow and composable; split big domains into multiple skills.
Write descriptions so triggering is reliable: when to use + what output to produce.
Prefer canonical examples over exhaustive edge-case lists.
Start with evaluation: observe where the agent fails on representative tasks, then build skills incrementally to patch those gaps.
Audit any scripts/resources (treat as production code): least privilege, safe defaults, and clear inputs/outputs.

Security considerations

Install skills only from trusted sources; audit instructions, scripts, and bundled resources.
Watch for instructions or dependencies that could enable unsafe actions or data exfiltration.

RAG (Retrieval-Augmented Generation)

RAG retrieves relevant context from a corpus and then generates an answer conditioned on that retrieved context.

Baseline pipeline

Ingest: parse docs, extract text, attach metadata.
Chunk: split into retrieval units (preserve structure and headings).
Embed: generate vectors for chunks (dense, sparse, or both).
Index: store vectors + metadata.
Retrieve: top-$k$ candidates for a query.
Rerank (optional): improve ordering using a stronger model.
Generate: answer with citations and constraints.

Chunking (the lever people underestimate)

Too small: loses context; retrieval gets noisy.
Too big: irrelevant text dilutes signal and increases tokens.
Align splits to structure (headings, paragraphs, code blocks).
Preserve metadata: title, section, URL/path, timestamp, permissions.

Retrieval

Dense (embeddings): semantic similarity.
Sparse (BM25): keyword matching.
Hybrid: combine dense + sparse for robustness.
Use metadata filters (product, date, language, access scope).

Definition: BM25

BM25 is a classic keyword-based ranking function used in sparse retrieval; it’s strong when exact terms matter (IDs, error codes, product names) and complements embeddings.

Reranking

Cross-encoder rerankers are strong but add latency.
Heuristic reranking helps (freshness, authority, doc type).

Definition: Cross-encoder reranker

A cross-encoder scores a (query, document) pair together (full attention across both), which usually improves ranking quality vs embedding similarity, but increases latency/cost.

Generation and citations

Provide retrieved passages as quoted context; require citations.
Enforce constraints: “If not in sources, say you don’t know.”
Consider a two-step: draft → verify against citations.

Common failure modes

good retrieval, bad generation (model ignores context),
bad retrieval (embedding mismatch, chunking issues, no hybrid),
prompt injection from retrieved docs (treat as data),
stale indexes (no incremental ingestion or TTL).

Evals and operations

Retrieval: recall@k, MRR, nDCG.
Answers: faithfulness, completeness, helpfulness (human eval for ambiguity).
Observability: log doc IDs, scores, chunks shown, rerank results.
Cost controls: context budgets, caching embeddings/retrieval, summarization.

Definitions: recall@k / MRR / nDCG

recall@k: whether at least one relevant chunk/doc appears in the top k retrieved results.

MRR (Mean Reciprocal Rank): emphasizes ranking the first relevant result near the top (higher when the first relevant item appears earlier).

nDCG (normalized Discounted Cumulative Gain): measures ranking quality when relevance can be graded (very relevant vs somewhat relevant), rewarding good ordering.

Enhancements (practical)

query rewriting,
multi-hop retrieval,
self-ask (subquestions per hop),
structured retrieval (tables/JSON extraction before generation).

References

OpenAI: A practical guide to building agents
Google Cloud / Workspace: AI agents for business (Gemini Enterprise) — TODO: add direct link
Anthropic: Building effective agents
Anthropic: Effective context engineering for AI agents
Anthropic: Equipping agents for the real world with Agent Skills
Agentic design patterns (router / planner-executor / graphs) — TODO: add canonical link