6. Applications: Agents and RAG

Overview

This chapter is interview-prep study notes for applying LLMs to agents, RAG, and prompting.

Analogy: think of an agent as a new hire operating a computer—tools are what it can touch (APIs/actions), and skills are the onboarding playbooks it can load on demand (checklists, templates, scripts). You build capable agents by giving them a small set of reliable tools, a library of skills, and a controller loop that decides what to load/use next.

What you’ll learn:

  • How to decide workflow vs agent, and when autonomy is worth the variance.
  • How to design tools (and optionally MCP) so agents are reliable in production.
  • How to think about guardrails, evaluation, and observability as the real sources of reliability.
  • A practical RAG pipeline and how to evaluate retrieval (recall@k, MRR, nDCG).
  • Prompting patterns that scale: output contracts, validation, and failure-mode thinking.

Agents

What is an agent?

An agent is a system that can plan and act: it iterates over (1) understanding state, (2) selecting an action (often a tool call), (3) observing results, and (4) updating state until a goal is met or it stops.

Useful interview distinction:

  • Workflow: LLM + tools orchestrated through predefined code paths (predictable, testable).
  • Agent: LLM dynamically directs its own steps and tool usage (flexible, higher variance).

When should you build an agent?

Agents fit workflows that resist deterministic automation:

  • Complex decision-making: exceptions, judgment, context-sensitive decisions.
  • Brittle rules: rule engines that are expensive to maintain.
  • Unstructured data: email threads, docs, tickets, PDFs, free-form user text.

If a deterministic pipeline or a single LLM call + retrieval can meet the target, prefer that.

Common agent products (examples)

  • Enterprise search across systems: permissions-aware retrieval + citation-first answers.
  • Support triage: routing + tool use + human handoff on edge cases.
  • Deep research briefs: planner–executor + multi-hop retrieval + synthesis.
  • Coding/debugging agent: ReAct loop with repo search + tests as verification.
  • Ops automations (refunds/cancellations): write tools with idempotency + confirmations.

Single-agent design foundations (what to whiteboard)

In its simplest form:

  1. Model: the LLM used for reasoning and decisions.
  2. Tools: functions/APIs that read data or take actions.
  3. Instructions: routines, constraints, and output contracts.
  4. Controller loop: run until a stop condition.

Typical stop conditions:

  • model emits a final structured output,
  • model responds with no tool calls,
  • a tool errors or times out,
  • max turns or max tool calls.

Minimal agent loop (mental model)

while not done(state):
    decision = llm(state, context)
    if decision.tool_call:
        result = tool(decision.args)
        state = update(state, result)
    else:
        return decision.response

Tools and MCP (what interviewers care about)

Treat tools as an agent-computer interface (ACI).

NoteDefinitions (tool reliability)

MCP (Model Context Protocol): a standardized way to expose tools/resources to an agent with consistent schemas; interview intuition: “a tool integration layer,” not a safety boundary.

Idempotency: make write actions safe to retry (same request ID → same effect, no duplicate refunds/emails).

Allowlist: an explicit list of permitted tools/actions enforced by your application (never delegated to the model).

Tooling checklist:

  • Schemas: explicit parameters and return types; validate inputs/outputs.
  • Naming: make tool names and arguments obvious; avoid overlapping tools.
  • Timeouts/retries: retries with backoff; classify transient vs permanent errors.
  • Idempotency: for write actions, add request IDs and make retries safe.
  • Observability: log tool name, args, latency, errors, and results.

If using MCP:

  • keep a registry of tools + JSON schemas,
  • enforce budgets (max calls, max latency),
  • implement allowlists/permissions outside the model.

Context and Memory Management

Context = what you pass to the model now.

How to do effective context engineering for AI agents?

  • Prompt engineering → context engineering: beyond writing instructions, you’re curating the full token set each turn (instructions, tools, message history, retrieved data, tool outputs).
  • Context is a finite “attention budget”: longer contexts can degrade recall/precision (“context rot”); treat tokens as scarce resources with diminishing returns.
  • Guiding principle: aim for the smallest high-signal context that reliably produces the desired behavior.
  • System prompt altitude (“Goldilocks”): avoid brittle pseudo-code / if-else logic in prompts and avoid vague guidance; be clear, direct, and structured.
  • Structure helps: organize instructions into explicit sections (e.g., background, constraints, tool guidance, output spec). This makes it easier to maintain and to debug failures.
  • Examples beat rule-lists: prefer a small set of diverse, canonical few-shot examples over “laundry lists” of edge cases.
  • Tools are context shapers: design tool contracts for token efficiency and unambiguous choice; bloated/overlapping toolsets increase confusion and waste context.
  • Retrieval strategy shift: complement precomputed retrieval (embeddings/RAG) with just-in-time retrieval where the agent keeps lightweight references (paths/URLs/IDs) and pulls details only when needed.
  • Progressive disclosure: let the agent explore incrementally; metadata (names, folder structure, timestamps) provides useful signals without loading entire artifacts.

Long-horizon techniques (when the task outgrows the context window)

  • Compaction: periodically summarize and restart with a high-fidelity condensed state (keep decisions, open issues, next steps; drop redundant chatter).

  • Tool-result clearing: a low-risk compaction move—once a tool result is “digested,” avoid re-including the raw payload deep in history.

  • Structured note-taking (agentic memory): persist notes outside the context window (e.g., TODOs, decisions, constraints) and rehydrate them as needed.

  • Sub-agent architectures: delegate deep exploration to subagents with clean contexts; have them return distilled summaries for the main agent.

  • Pinned context: system policy + tool schemas.

  • Working set: recent turns + latest tool results.

  • Retrieved context: RAG results pulled on-demand.

  • Summaries: compress long traces into structured state.

Memory = what persists across sessions.

  • Short-term: current conversation state.
  • Long-term: durable facts/preferences with provenance and TTL.

Common pitfalls:

  • mixing instructions with retrieved data (prompt injection risk),
  • dumping raw logs into context (token blow-up),
  • storing PII without consent.

Orchestration patterns (single + multi-agent)

Start simple and add structure only when it improves measurable outcomes.

Workflows (predefined orchestration paths):

  • Prompt chaining: step-by-step decomposition + gates/checks.
  • Routing: classify input and send to specialized prompts/models.
  • Parallelization: sectioning (subtasks) or voting (multiple attempts).
  • Orchestrator–workers: manager decomposes dynamically, workers execute, manager synthesizes.
  • Evaluator–optimizer: generate → critique → refine against explicit criteria.

Agentic systems:

  • Single-agent loop: one agent + tools + instructions.
  • Manager (agents-as-tools): one agent owns the user interaction and delegates.
  • Decentralized handoffs: agents transfer control peer-to-peer (good for triage).

Guardrails (layered defenses)

Guardrails should combine multiple layers:

  • Relevance: keep the agent on-scope.
  • Safety/jailbreak/injection detection.
  • PII filtering and redaction at boundaries.
  • Moderation for harmful content.
  • Rules-based checks: regex/blocklists, input limits, schema validation.
  • Tool safeguards: risk-rate tools (read-only vs write, reversibility, $ impact) and require extra checks or user confirmation for high-risk actions.

Human-in-the-loop triggers:

  • exceeding failure thresholds (retries, repeated tool errors),
  • high-risk actions (refunds, cancellations, prod changes).

Evaluation and observability

Evaluate agents as systems:

  • Task success rate on realistic workflows.
  • Tool-call accuracy: correct tool, correct args, correct sequencing.
  • Time/cost: wall-clock, tokens, tool calls.
  • Safety: disallowed tool attempts, policy violations, PII leakage.

Always log:

  • prompt version, tools + args, tool outputs, stop reason, and user-visible outcome.

Case study: toy news agent (fast_agent + fast_mcp)

Goal: “Given a topic, autonomously explore the news landscape, iteratively refine queries, verify claims against sources, and produce a cited brief with an explicit stop reason.”

High-level architecture

  • Agent client (CLI/UI) collects user intent and displays results.
  • Agent runtime (e.g., fast_agent) runs the loop: decide → call tools → synthesize.
  • LLM provider (e.g., OpenRouter) supplies the model used for reasoning + summarization.
  • MCP server (e.g., fast_mcp) exposes tools like news.search and news.fetch.
NoteDefinitions (MCP + client + model routing)

agent client: the user-facing surface (CLI, web UI, Slack bot) that sends tasks to an agent runtime and renders results. It should not embed tool logic; keep that in the runtime/tooling layer.

MCP (Model Context Protocol): a standard protocol for advertising tools/resources to an agent runtime with consistent contracts. Interview intuition: it’s an integration layer that lets one client talk to many tools safely and consistently.

OpenRouter: a hosted API that routes a single OpenAI-compatible request format to many different model providers. Interview intuition: it’s a convenient way to swap models (or use multiple models) without rewriting your client.

Why an LLM “needs” MCP (interview framing)

  • The model itself doesn’t speak to your databases or APIs; an agent runtime does.
  • MCP makes “tool discovery + invocation” consistent across tools and across clients.
  • MCP centralizes governance: tool allowlists, permissions, and budgets can be enforced outside the model.

How MCP acts as an adapter

  • Agent client ↔︎ agent runtime: UX and session management.
  • Agent runtime ↔︎ LLM: completion/chat API (OpenAI-style, OpenRouter, etc.).
  • Agent runtime ↔︎ tools: MCP is the adapter layer that standardizes tool contracts and lets you register many tools behind one interface.

Toy toolset (registered on the MCP server)

  • news.search(query, since_days, limit) -> [{title, url, source, published_at}]
  • news.fetch(url) -> {url, title, text}
  • news.summarize(text, max_bullets) -> {bullets} (optional; many teams do summarization in the LLM instead)
NoteDefinition: tool contract

A tool contract is the strict input/output interface an agent runtime relies on (names, types, required fields, error semantics). It’s what makes tool calls testable and safe.

Important features of tool contracts (what interviewers look for)

  • Deterministic outputs where possible (tools should not be “creative”).
  • Explicit types/shape: required vs optional fields; stable field names.
  • Clear failure semantics: structured errors, retryable vs non-retryable.
  • Timeouts and bounded work (tool must fail fast rather than hang).
  • Idempotency for writes (request IDs) and safe defaults.
  • Auth/permissions handled by the runtime, not the model.
  • Observability hooks: request IDs, tool name, latency, and minimal logging of sensitive fields.

Agentic loop (what you’d explain on a whiteboard)

  1. Client sends intent + constraints (topic, time window, preferred sources, max budget).
  2. Agent proposes a short research plan (subtopics, query variants, coverage targets).
  3. Agent runs iterative cycles:
    • call news.search with a query variant,
    • call news.fetch for the most promising sources,
    • extract candidate claims + attach citations,
    • assess coverage/novelty/credibility; decide what’s missing.
  4. Agent adapts: rewrite queries, broaden/narrow scope, or request clarification if constraints conflict.
  5. Stop when one condition is met: coverage threshold reached, marginal gains fall below a threshold, budget exhausted, or sources conflict (trigger “needs human review”).
  6. Return a structured brief (key findings + citations + confidence + stop reason).

Example skeleton (template, APIs vary by library)

MCP server (fast_mcp-style pseudocode):

# mcp_server.py

# Pseudocode: adjust imports/APIs to your fast_mcp version.

@tool(name="news.search")
def search(query: str, since_days: int = 7, limit: int = 5) -> list[dict]:
    """Return a list of candidate articles with URL + metadata."""
    ...


@tool(name="news.fetch")
def fetch(url: str) -> dict:
    """Fetch article text from a URL."""
    ...


def main():
    server = FastMCPServer(tools=[search, fetch])
    server.run()

Agent runtime (fast_agent-style pseudocode):

# agent_client.py

# Pseudocode: emphasize an adaptive loop (plan → act → observe → decide).

mcp = MCPClient("http://localhost:8000")
llm = OpenRouterChat(model="openai/gpt-4o-mini", api_key=os.environ["OPENROUTER_API_KEY"])

goal = {
    "topic": "AI agents",
    "since_days": 7,
    "max_sources": 12,
    "max_tool_calls": 25,
}

state = {
    "queries_tried": [],
    "sources": [],  # {url, title, source, published_at, text}
    "claims": [],   # {claim, citations:[url], confidence}
    "stop_reason": None,
}

while True:
    if len(state["sources"]) >= goal["max_sources"]:
        state["stop_reason"] = "source_budget_reached"
        break
    if tool_calls_used() >= goal["max_tool_calls"]:
        state["stop_reason"] = "tool_budget_reached"
        break

    plan = llm(
        "Propose the next best action (search/fetch), "
        "a query variant if needed, and what coverage is missing.",
        context={"goal": goal, "state": state},
    )

    if plan.action == "search":
        hits = mcp.call("news.search", query=plan.query, since_days=goal["since_days"], limit=10)
        state["queries_tried"].append(plan.query)
        state = update_state_with_hits(state, hits)

    elif plan.action == "fetch":
        doc = mcp.call("news.fetch", url=plan.url)
        state = update_state_with_doc(state, doc)

    elif plan.action == "stop":
        state["stop_reason"] = plan.reason  # e.g., "coverage_sufficient" / "sources_conflict"
        break

    # Extract/verify claims against collected sources (LLM-as-judge + strict citation rules).
    state["claims"] = llm(
        "Extract candidate claims and attach citations; "
        "if a claim lacks support, mark it as uncertain.",
        context={"sources": state["sources"]},
    )

brief = llm(
    "Write a structured briefing with citations, confidence, and stop_reason.",
    context={"goal": goal, "claims": state["claims"], "stop_reason": state["stop_reason"]},
)
print(brief)

A bunch of other frameworks (including MCP-based stacks) support tool use in essentially the same way—this landscape is changing fast:

  • LangChain / LangGraph: large ecosystem, graph-based orchestration.
  • LlamaIndex: strong RAG + data connectors.
  • Semantic Kernel: enterprise-friendly agent + tool abstractions.
  • AutoGen / CrewAI: multi-agent coordination patterns.
  • Haystack: retrieval pipelines + production patterns.
  • DSPy / PydanticAI: programmatic prompting / typed structured outputs.
  • MCP ecosystem: tool servers + client runtimes that standardize tool wiring.

Interview drills (fast Q&A)

System design template (talk track)

One framework to answer most “Design an agent that does X” questions:

  • 1) Clarify X + constraints: what’s the input, output, latency/SLA, and what’s out of scope?
  • 2) Define success + risk metrics: task success rate, cost/latency, escalation rate, and domain-specific error rates (e.g., false approvals).
  • 3) Choose workflow vs agent: start with the simplest deterministic workflow that can hit the target; justify autonomy only for long-tail decisions.
  • 4) Specify tools + contracts: list read vs write tools, schemas, retries/timeouts, and idempotency for writes; enforce permissions/allowlists outside the model.
  • 5) Design state + context: what’s pinned (policy/tools), what’s retrieved (RAG), what’s in working memory, and what persists (notes/memory with provenance + TTL).
  • 6) Orchestration: single-agent loop first; add routing / planner–executor / evaluators / subagents only when it improves measured outcomes.
  • 7) Guardrails + human handoff: risk-rate tools/actions, require confirmation for high-blast-radius writes, and define clear escalation triggers.
  • 8) Evals + rollout: build replayable test sets, measure tool-call correctness + safety, ship in shadow/assist mode, then tighten budgets and monitoring.

“For the toy news agent, what should the tool contracts look like?”

  • news.search: strict schema, bounded results, stable IDs/URLs, and clear error semantics (retryable vs not).
  • news.fetch: deterministic extraction, timeouts, size limits, and normalized fields (title/source/published_at/text).
  • Prefer returning summaries + pointers over huge payloads; keep token cost predictable.

“How do you prevent ‘search → fetch everything’ tool spam?”

  • Budgets: max tool calls, max sources, max tokens per tool result; stop when marginal gain falls.
  • Planner step must emit an action + rationale + stop criteria; log stop_reason.
  • Add lightweight heuristics: dedupe by URL/domain, prefer primary sources, diversify outlets.

“How do you evaluate whether the agent is actually doing good retrieval?”

  • Retrieval metrics: recall@k / MRR / nDCG on a labeled set of “topic → relevant articles”.
  • Behavioral metrics: source diversity, freshness, redundancy rate, and citation coverage (claims with ≥2 independent sources).
  • End-to-end: briefing faithfulness (citations support claims), contradiction rate, and abstention quality.

“How do you handle context growth over a 25-tool-call loop?”

  • Treat context as an attention budget: keep raw tool outputs out of the prompt once digested.
  • Maintain structured state (queries tried, sources metadata, extracted claims) and compact periodically.
  • Retrieve just-in-time via references (URLs/IDs) rather than carrying full documents.

“Where do agents like this fail in practice, technically?”

  • Ambiguous tools or overlapping capabilities → wrong tool choice.
  • Context pollution: too many low-signal snippets → missed key facts (“context rot”).
  • Citation drift: model states claims without anchoring to source spans.
  • Latency/cost blowups: unconstrained fetching, reranking, or repeated query rewrites.

“How do you make this safe and robust?”

  • Treat retrieved text as data: isolate it, forbid it from overriding instructions, and sanitize HTML.
  • Enforce allowlists + permissions outside the model; risk-rate tools even if they’re read-only.
  • Add validators: schema checks, citation-required checks, and retries with explicit error messages.
  • Human-in-the-loop triggers: conflicting sources, low confidence, repeated tool errors, or budget exhaustion.

Skills

Claude Skills (Agent Skills) are reusable, composable packages of instructions + scripts + resources (files/folders) that equip an agent with domain-specific procedural knowledge—like writing an onboarding guide for a new hire.

Mental model

  • Skills = “procedural expertise” you install once instead of re-pasting long prompts.
  • Progressive disclosure keeps context lean: only lightweight metadata is always present; details load on-demand.
  • Skills can include deterministic scripts for steps where token-by-token generation is slow or unreliable.

How Skills work (high level)

  1. Startup: the agent pre-loads each skill’s name + description (from YAML frontmatter) into the system prompt.
  2. Relevance check: on a user task, the model decides whether to trigger a skill (similar to tool selection).
  3. On-demand load: when triggered, the agent reads the full SKILL.md into context.
  4. Deeper load: SKILL.md can reference additional files (e.g., forms.md, reference.md); the model reads them only if needed.
  5. Code execution (optional): skills may include scripts that the agent can run for deterministic steps, without needing to load the entire script (or large inputs) into the context window.

Skill folder structure

  • A skill is just a directory (e.g., changelog-newsletter/).
  • Required: SKILL.md at the root.
    • Starts with YAML frontmatter:
name: Changelog to Newsletter
description: Convert git changelogs into a reader-friendly newsletter
- Followed by Markdown instructions, checklists, examples, and links to other files.
  • Optional:
    • templates.md, examples.md, reference.md
    • scripts/ with *.py helpers for parsing/validation/formatting
    • resources/ with sample inputs/outputs

Progressive disclosure levels (useful interview phrasing)

  • Level 1: metadata only (name + description) — always present.
  • Level 2: full SKILL.md — loaded when the skill is triggered.
  • Level 3+: linked files/resources/scripts — loaded/executed only when relevant.

Skills vs prompting vs tools (interview answers)

  • Are Skills “prompt tech”?
    • Skills use prompting techniques (instructions, few-shot examples, templates), but they’re not just prompts: they add packaging, persistence, and on-demand loading.
  • Skills vs prompting:
    • Prompts are typically one-off or session-scoped; skills are versionable artifacts you can share across users/sessions and keep token-efficient via progressive disclosure.
  • Skills vs tool calls:
    • Tools provide the capability/action (“do X / fetch Y”); skills provide the playbook (“how to do X reliably”) and can teach consistent orchestration (plus optional scripts).

Skills + MCP (how they fit together)

  • MCP provides standardized access to tools/data sources (“connectivity layer”).
  • Skills provide the organization’s workflow, heuristics, and output formats (“operating procedure”).
  • Combined pattern: MCP exposes news.search/news.fetch, and a “News Briefing” skill defines query strategies, citation rules, budgets/stop reasons, and validation steps.

Best practices

  • Keep skills narrow and composable; split big domains into multiple skills.
  • Write descriptions so triggering is reliable: when to use + what output to produce.
  • Prefer canonical examples over exhaustive edge-case lists.
  • Start with evaluation: observe where the agent fails on representative tasks, then build skills incrementally to patch those gaps.
  • Audit any scripts/resources (treat as production code): least privilege, safe defaults, and clear inputs/outputs.

Security considerations

  • Install skills only from trusted sources; audit instructions, scripts, and bundled resources.
  • Watch for instructions or dependencies that could enable unsafe actions or data exfiltration.

RAG (Retrieval-Augmented Generation)

RAG retrieves relevant context from a corpus and then generates an answer conditioned on that retrieved context.

Baseline pipeline

  1. Ingest: parse docs, extract text, attach metadata.
  2. Chunk: split into retrieval units (preserve structure and headings).
  3. Embed: generate vectors for chunks (dense, sparse, or both).
  4. Index: store vectors + metadata.
  5. Retrieve: top-\(k\) candidates for a query.
  6. Rerank (optional): improve ordering using a stronger model.
  7. Generate: answer with citations and constraints.

Chunking (the lever people underestimate)

  • Too small: loses context; retrieval gets noisy.
  • Too big: irrelevant text dilutes signal and increases tokens.
  • Align splits to structure (headings, paragraphs, code blocks).
  • Preserve metadata: title, section, URL/path, timestamp, permissions.

Retrieval

  • Dense (embeddings): semantic similarity.
  • Sparse (BM25): keyword matching.
  • Hybrid: combine dense + sparse for robustness.
  • Use metadata filters (product, date, language, access scope).
NoteDefinition: BM25

BM25 is a classic keyword-based ranking function used in sparse retrieval; it’s strong when exact terms matter (IDs, error codes, product names) and complements embeddings.

Reranking

  • Cross-encoder rerankers are strong but add latency.
  • Heuristic reranking helps (freshness, authority, doc type).
NoteDefinition: Cross-encoder reranker

A cross-encoder scores a (query, document) pair together (full attention across both), which usually improves ranking quality vs embedding similarity, but increases latency/cost.

Generation and citations

  • Provide retrieved passages as quoted context; require citations.
  • Enforce constraints: “If not in sources, say you don’t know.”
  • Consider a two-step: draft → verify against citations.

Common failure modes

  • good retrieval, bad generation (model ignores context),
  • bad retrieval (embedding mismatch, chunking issues, no hybrid),
  • prompt injection from retrieved docs (treat as data),
  • stale indexes (no incremental ingestion or TTL).

Evals and operations

  • Retrieval: recall@k, MRR, nDCG.
  • Answers: faithfulness, completeness, helpfulness (human eval for ambiguity).
  • Observability: log doc IDs, scores, chunks shown, rerank results.
  • Cost controls: context budgets, caching embeddings/retrieval, summarization.
NoteDefinitions: recall@k / MRR / nDCG

recall@k: whether at least one relevant chunk/doc appears in the top k retrieved results.

MRR (Mean Reciprocal Rank): emphasizes ranking the first relevant result near the top (higher when the first relevant item appears earlier).

nDCG (normalized Discounted Cumulative Gain): measures ranking quality when relevance can be graded (very relevant vs somewhat relevant), rewarding good ordering.

Enhancements (practical)

  • query rewriting,
  • multi-hop retrieval,
  • self-ask (subquestions per hop),
  • structured retrieval (tables/JSON extraction before generation).

References