7. Applications: Agents and RAG

Overview

This chapter is interview-prep study notes for building agentic applications.

1.1 What an agentic application is (system > model)

An agentic application is a system that can repeatedly:

  • observe the environment,
  • decide what to do next,
  • take actions (often via tools),
  • verify outcomes,
  • update its state,

until it reaches a goal or stops.

Useful interview distinction:

  • Workflow: LLM + tools orchestrated through predefined code paths (predictable, testable).
  • Agent: LLM dynamically directs its own steps and tool usage (flexible, higher variance).
NoteDecision rule (workflow vs agent)

Prefer a workflow when you can write down the steps and test them.

Use an agent when:

  • the sequence of steps is not known in advance,
  • the task requires iterative exploration (search → refine → fetch → verify),
  • you can enforce budgets and measure success (and fail closed).

Agents fit workflows that resist deterministic automation:

  • Complex decision-making: exceptions, judgment, context-sensitive decisions.
  • Brittle rules: rule engines that are expensive to maintain.
  • Unstructured data: email threads, docs, tickets, PDFs, free-form user text.

If a deterministic pipeline or a single LLM call + retrieval can meet the target, prefer that.

When not to use an agent (common trap):

  • when a deterministic pipeline is good enough (lower variance),
  • when the cost/latency budget is tight and you can’t bound tool usage,
  • when safety/compliance requires strictly deterministic behavior,
  • when you can’t evaluate it reliably (no test harness, no golden tasks).

Common agent products (examples)

  • Enterprise search across systems: permissions-aware retrieval + citation-first answers.
  • Support triage: routing + tool use + human handoff on edge cases.
  • Deep research briefs: planner–executor + multi-hop retrieval + synthesis.
  • Coding/debugging agent: loop with repo search + tests as verification.
  • Ops automations (refunds/cancellations): write tools with idempotency + confirmations.

1.2 Mental model: Sense → Think → Act → Check → Update State → Repeat

This is the minimal controller-loop you should have in your head.

flowchart LR
    S["Sense<br/>(observe state + environment)"] --> T["Think<br/>(policy + prompting)"]
    T --> A["Act<br/>(tool calls)"]
    A --> C["Check<br/>(verifiers/validators)"]
    C --> U["Update state<br/>(store facts, citations, decisions)"]
    U --> S

1.3 Where RAG / tools / skills / prompting fit in the loop

Prompting and RAG are building blocks; the agent is the system that uses them.

You can think of an agent as a controller loop wrapped around three things: contracts (prompting + validators), context (RAG), and actions (tools).

flowchart TB
    U[User goal] --> A[Agent: controller loop]
    A --> P["Prompting / contracts<br/>(output spec, validators, retries)"]
    A --> R["RAG / context<br/>(retrieve, rerank, cite)"]
    A --> T["Tools / actions<br/>(APIs, DBs, side effects)"]
    A --> K["Skills / procedures<br/>(playbooks loaded on demand)"]
    P --> A
    R --> A
    T --> A
    K --> A

Analogy: think of an agent as a new hire operating a computer—tools are what it can touch (APIs/actions), and skills are the onboarding playbooks it can load on demand (checklists, templates, scripts). You build capable agents by giving them a small set of reliable tools, a library of skills, and a controller loop that decides what to load/use next.

What you’ll learn:

  • How to decide workflow vs agent, and when autonomy is worth the variance.
  • How to design tools (and optionally MCP) so agents are reliable in production.
  • How to think about guardrails, evaluation, and observability as the real sources of reliability.
  • A practical RAG pipeline and how to evaluate retrieval (recall@k, MRR, nDCG).
  • Prompting patterns that scale: output contracts, validation, and failure-mode thinking.
TipIf you only remember 5 things
  • Start with a workflow; add agent autonomy only for long-tail decisions.
  • Treat tools as contracts (schemas, timeouts, idempotency, error semantics).
  • Put budgets on the loop (max tool calls / time / tokens) and always return a stop_reason.
  • Separate instructions from retrieved content; treat retrieved text as untrusted input.
  • Reliability comes from eval + observability, not clever prompts.

Core building blocks (agent architecture)

This section breaks an agent into components you can reason about, implement, and test.

2.1 Policy (the “brain”)

Policy is the combination of:

  • the model(s) you choose,
  • the routing logic (which model to call when),
  • and the high-level decision behavior (budgets, preferences, when to call tools, when to ask clarifying questions).

Model choice + routing (fast vs smart)

Common pattern:

  • Fast model for routing, extraction, and simple tool-use.
  • Smart model for hard reasoning, planning, and synthesis.

Routing can be rules-based (cheap, predictable) or learned/LLM-based (more flexible). Either way, make the routing observable and testable.

Decoding settings (when it matters)

For systems work, decoding is mostly about managing variance:

  • Lower temperature for tool calls, structured outputs, and verification.
  • Higher temperature can help brainstorming or generating alternatives, but should be bounded by validators.

2.2 Prompting (policy parameterization at runtime)

Prompting that scales means outputs are machine-checkable. Common patterns:

Instruction structure + hierarchy

  • Separate system policy (non-negotiable) from task instructions.
  • Keep tool schemas and safety rules pinned.
  • Put retrieved text in a clearly labeled “sources” block.

Output contracts (schemas), stop_reason

  • Output contracts: JSON schema / typed model output with required fields.
  • Validators: schema validation + invariant checks (e.g., “must include citations”).
  • Retry with feedback: on failure, retry with the validator error message.
  • Stop reasons: always include stop_reason so monitoring is actionable.

Example validator loop:

for attempt in range(3):
    out = llm(prompt, context)
    ok, err = validate(out)  # schema + invariants
    if ok:
        return out
    prompt = prompt + f"\n\nFix validation error: {err}"
raise RuntimeError("validation_failed")
NoteStop reason taxonomy (useful for logging)

Common stop reasons:

  • success
  • tool_budget_reached
  • time_budget_reached
  • tool_error
  • sources_conflict
  • needs_human_review

Tool-use prompting (when to call tools vs answer)

Typical rule: if correctness depends on external state (DBs, tickets, code, current policy), call a tool rather than guessing.

Repair prompts (retry with validator feedback)

The most production-grade prompting pattern is: generate → validate → repair with specific feedback → re-validate.

Retrieval-aware prompting (citations; treat retrieved text as untrusted)

  • Require citations for factual claims.
  • Treat retrieved text as untrusted input (prompt injection risk).
  • Separate instructions from sources; sources should not override policy.

2.3 State (working memory)

In production, treat state as an explicit object (goals, constraints, intermediate results) rather than “whatever is in the chat history.” This makes the loop testable and makes compaction/memory straightforward.

What belongs in state vs context

  • State: structured, distilled facts/decisions the agent relies on (small, stable).
  • Context: the token payload you pass to the model this turn (instructions + latest observations + selected sources).

Short-term task state vs long-term memory (provenance + TTL)

  • Short-term: current task scratchpad (queries tried, constraints, pending TODOs).
  • Long-term: durable facts/preferences with provenance and TTL.

Storing distilled tool/RAG results (facts, citations, decisions)

Prefer storing:

  • extracted facts + citations,
  • decisions + rationale,
  • and pointers/IDs to raw payloads,

instead of repeatedly re-including raw tool outputs.

2.4 Planner (optional)

Planning is how the system selects subgoals and decides when to stop.

  • Decomposition: break a goal into subgoals and track completion.
  • Termination criteria: define “done” and “not worth it” conditions.

You can implement planning as:

  • Explicit planner (separate step or separate model), or
  • Implicit planning inside the policy prompt.

Explicit planning is easier to observe and debug, but adds latency.

Orchestration patterns (single + multi-agent)

Start simple and add structure only when it improves measurable outcomes.

Workflows (predefined orchestration paths):

  • Prompt chaining: step-by-step decomposition + gates/checks.
  • Routing: classify input and send to specialized prompts/models.
  • Parallelization: sectioning (subtasks) or voting (multiple attempts).
  • Orchestrator–workers: manager decomposes dynamically, workers execute, manager synthesizes.
  • Evaluator–optimizer: generate → critique → refine against explicit criteria.

Agentic systems:

  • Single-agent loop: one agent + tools + instructions.
  • Manager (agents-as-tools): one agent owns the user interaction and delegates.
  • Decentralized handoffs: agents transfer control peer-to-peer (good for triage).

2.5 Tool interface (actions)

Treat tools as an agent-computer interface (ACI).

Tooling checklist:

  • Schemas: explicit parameters and return types; validate inputs/outputs.
  • Naming: make tool names and arguments obvious; avoid overlapping tools.
  • Timeouts/retries: retries with backoff; classify transient vs permanent errors.
  • Idempotency: for write actions, add request IDs and make retries safe.
  • Observability: log tool name, args, latency, errors, and results.

Tool contracts: schema, validation, timeouts, retries, idempotency

NoteDefinitions (tool reliability)

MCP (Model Context Protocol): a standardized way to expose tools/resources to an agent with consistent schemas; interview intuition: “a tool integration layer,” not a safety boundary.

Idempotency: make write actions safe to retry (same request ID → same effect, no duplicate refunds/emails).

Allowlist: an explicit list of permitted tools/actions enforced by your application (never delegated to the model).

Permissions + allowlists (enforced outside the model)

  • Enforce permissions in your runtime (authz, scopes, tenancy), not in prompts.
  • Risk-rate tools (read-only vs write; reversible vs irreversible; $ impact).

Tool contract template (practical)

A good default is to define (and validate) for every tool:

  • Name + intent
  • Input schema (types, required fields, constraints)
  • Output schema (types, required fields)
  • Side effects (read-only vs write; reversible vs irreversible)
  • Error semantics (retryable vs permanent)
  • Timeout + size bounds
  • Idempotency key (required for write tools)

A lightweight risk-tier system helps decide what guardrails to apply:

Tier Examples Controls
read_low search, fetch, lookup timeouts, size caps
write_medium create ticket, draft email idempotency, dry-run
write_high refunds, cancellations, prod changes confirmation, allowlist

2.6 Retriever / knowledge access (RAG)

Prompting that scales means outputs are machine-checkable. Common patterns:

  • Output contracts: JSON schema / typed model output with required fields.
  • Validators: schema validation + invariant checks (e.g., “must include citations”).
  • Retry with feedback: on failure, retry with the validator error message.
  • Stop reasons: always include stop_reason so monitoring is actionable.

Example validator loop:

for attempt in range(3):
    out = llm(prompt, context)
    ok, err = validate(out)  # schema + invariants
    if ok:
        return out
    prompt = prompt + f"\n\nFix validation error: {err}"
raise RuntimeError("validation_failed")
NoteStop reason taxonomy (useful for logging)

Common stop reasons:

  • success
  • tool_budget_reached
  • time_budget_reached
  • tool_error
  • sources_conflict
  • needs_human_review

RAG as context (retrieve, rerank, cite)

RAG (Retrieval-Augmented Generation) retrieves relevant context from a corpus and then generates an answer conditioned on that retrieved context. In agentic systems, retrieval is often iterative: retrieve → read → decide → retrieve again.

Baseline pipeline

  1. Ingest: parse docs, extract text, attach metadata.
  2. Chunk: split into retrieval units (preserve structure and headings).
  3. Embed: generate vectors for chunks (dense, sparse, or both).
  4. Index: store vectors + metadata.
  5. Retrieve: top-\(k\) candidates for a query.
  6. Rerank (optional): improve ordering using a stronger model.
  7. Generate: answer with citations and constraints.

Chunking (the lever people underestimate)

  • Too small: loses context; retrieval gets noisy.
  • Too big: irrelevant text dilutes signal and increases tokens.
  • Align splits to structure (headings, paragraphs, code blocks).
  • Preserve metadata: title, section, URL/path, timestamp, permissions.

Retrieval

  • Dense (embeddings): semantic similarity.
  • Sparse (BM25): keyword matching.
  • Hybrid: combine dense + sparse for robustness.
  • Use metadata filters (product, date, language, access scope).
NoteDefinition: BM25

BM25 is a classic keyword-based ranking function used in sparse retrieval; it’s strong when exact terms matter (IDs, error codes, product names) and complements embeddings.

Reranking

  • Cross-encoder rerankers are strong but add latency.
  • Heuristic reranking helps (freshness, authority, doc type).
NoteDefinition: Cross-encoder reranker

A cross-encoder scores a (query, document) pair together (full attention across both), which usually improves ranking quality vs embedding similarity, but increases latency/cost.

Generation and citations

  • Provide retrieved passages as quoted context; require citations.
  • Enforce constraints: “If not in sources, say you don’t know.”
  • Consider a two-step: draft → verify against citations.

Common failure modes

  • good retrieval, bad generation (model ignores context),
  • bad retrieval (embedding mismatch, chunking issues, no hybrid),
  • prompt injection from retrieved docs (treat as data),
  • stale indexes (no incremental ingestion or TTL).

Evals and operations

  • Retrieval: recall@k, MRR, nDCG.
  • Answers: faithfulness, completeness, helpfulness (human eval for ambiguity).
  • Observability: log doc IDs, scores, chunks shown, rerank results.
  • Cost controls: context budgets, caching embeddings/retrieval, summarization.
NoteDefinitions: recall@k / MRR / nDCG

recall@k: whether at least one relevant chunk/doc appears in the top k retrieved results.

MRR (Mean Reciprocal Rank): emphasizes ranking the first relevant result near the top (higher when the first relevant item appears earlier).

nDCG (normalized Discounted Cumulative Gain): measures ranking quality when relevance can be graded (very relevant vs somewhat relevant), rewarding good ordering.

Enhancements (practical)

  • query rewriting,
  • multi-hop retrieval,
  • self-ask (subquestions per hop),
  • structured retrieval (tables/JSON extraction before generation).

2.7 Environment

The environment is where the agent operates.

  • Observations: what the agent can perceive (tool reads, user input, logs, retrieved docs).
  • Actions: what it can do (tool writes, messages, tickets, deploys).
  • Success criteria: the definition of “done” and acceptable failure modes.

In interviews, explicitly define environment boundaries (what the agent can and cannot access) and what “ground truth” looks like.

2.8 Critic / verifier (checks inside the loop)

Verification is where most reliability comes from.

  • Schema + invariant checks: validate structured outputs and required fields.
  • Rules/unit tests: cheap deterministic checks for obvious mistakes.
  • Citation/grounding verification: ensure claims are supported by sources.
  • LLM-as-judge: useful for fuzzy checks (style, completeness), but treat it as a probabilistic component.

2.9 Control loop & termination

Minimal agent loop (mental model):

while not done(state):
    decision = llm(state, context)
    if decision.tool_call:
        result = tool(decision.args)
        state = update(state, result)
    else:
        return decision.response

Typical stop conditions:

  • model emits a final structured output,
  • model responds with no tool calls,
  • a tool errors or times out,
  • max turns or max tool calls.

Stop vs clarify vs escalate

  • Stop when success criteria are met (or marginal gain is low).
  • Clarify when user constraints are missing or conflicting.
  • Escalate (human-in-the-loop) for high-risk actions, conflicting sources, repeated tool failures, or low confidence.

Budgets (tokens/time/cost/tool calls) + safety cutoffs

  • Max turns, max tool calls, max wall-clock.
  • Safety cutoffs for high-risk tools (require confirmation).
  • Always return and log a stop_reason.

Retry policy (production default)

  • Classify tool errors (retryable vs permanent).
  • Retry only when idempotent and safe.
  • Bound retries and surface a clear stop_reason when exhausted.

2.10 Skills (packaged procedures)

Claude Skills (Agent Skills) are reusable, composable packages of instructions + scripts + resources (files/folders) that equip an agent with domain-specific procedural knowledge—like giving a new hire an internal playbook.

Mental model

  • Skills = “procedural expertise” you install once instead of re-pasting long prompts.
  • Progressive disclosure keeps context lean: only lightweight metadata is always present; details load on-demand.
  • Skills can include deterministic scripts for steps where token-by-token generation is slow or unreliable.

How Skills work (high level)

  1. Startup: the agent pre-loads each skill’s name + description (from YAML frontmatter) into the system prompt.
  2. Relevance check: on a user task, the model decides whether to trigger a skill (similar to tool selection).
  3. On-demand load: when triggered, the agent reads the full SKILL.md into context.
  4. Deeper load: SKILL.md can reference additional files (e.g., forms.md, reference.md); the model reads them only if needed.
  5. Code execution (optional): skills may include scripts that the agent can run for deterministic steps, without needing to load the entire script (or large inputs) into the context window.

Skill folder structure

  • A skill is just a directory (e.g., changelog-newsletter/).
  • Required: SKILL.md at the root.
    • Starts with YAML frontmatter:
name: Changelog to Newsletter
description: Convert git changelogs into a reader-friendly newsletter
- Followed by Markdown instructions, checklists, examples, and links to other files.
  • Optional:
    • templates.md, examples.md, reference.md
    • scripts/ with *.py helpers for parsing/validation/formatting
    • resources/ with sample inputs/outputs

Progressive disclosure levels (useful interview phrasing)

  • Level 1: metadata only (name + description) — always present.
  • Level 2: full SKILL.md — loaded when the skill is triggered.
  • Level 3+: linked files/resources/scripts — loaded/executed only when relevant.

Skills vs prompting vs tools (interview answers)

  • Are Skills “prompt tech”?
    • Skills use prompting techniques (instructions, few-shot examples, templates), but they’re not just prompts: they add packaging, persistence, and on-demand loading.
  • Skills vs prompting:
    • Prompts are typically one-off or session-scoped; skills are versionable artifacts you can share across users/sessions and keep token-efficient via progressive disclosure.
  • Skills vs tool calls:
    • Tools provide the capability/action (“do X / fetch Y”); skills provide the playbook (“how to do X reliably”) and can teach consistent orchestration (plus optional scripts).

Skills + MCP (how they fit together)

  • MCP provides standardized access to tools/data sources (“connectivity layer”).
  • Skills provide the organization’s workflow, heuristics, and output formats (“operating procedure”).
  • Combined pattern: MCP exposes news.search/news.fetch, and a “News Briefing” skill defines query strategies, citation rules, budgets/stop reasons, and validation steps.

Best practices

  • Keep skills narrow and composable; split big domains into multiple skills.
  • Write descriptions so triggering is reliable: when to use + what output to produce.
  • Prefer canonical examples over exhaustive edge-case lists.
  • Start with evaluation: observe where the agent fails on representative tasks, then build skills incrementally to patch those gaps.
  • Audit any scripts/resources (treat as production code): least privilege, safe defaults, and clear inputs/outputs.

Security considerations

  • Install skills only from trusted sources; audit instructions, scripts, and bundled resources.
  • Watch for instructions or dependencies that could enable unsafe actions or data exfiltration.

Reliability & safety (production readiness)

This is where agent projects usually succeed or fail: not in clever prompts, but in the production system around them.

3.1 Guardrails

Guardrails should combine multiple layers:

  • Relevance: keep the agent on-scope.
  • Safety/jailbreak/injection detection.
  • PII filtering and redaction at boundaries.
  • Moderation for harmful content.
  • Rules-based checks: regex/blocklists, input limits, schema validation.
  • Tool safeguards: risk-rate tools (read-only vs write, reversibility, $ impact) and require extra checks or user confirmation for high-risk actions.

Human-in-the-loop triggers:

  • exceeding failure thresholds (retries, repeated tool errors),
  • high-risk actions (refunds, cancellations, prod changes).

3.2 Observability

Always log:

  • prompt version, tools + args, tool outputs, stop reason, and user-visible outcome.

Useful split:

  • Offline evals (before launch): replayable “golden tasks” + tool-call correctness.
  • Online monitoring (after launch): success rate, escalation rate, budget exhaustion, tool error/latency.

Debug funnel (fast triage):

  1. Tooling failure (timeout, bad args, permissions).
  2. Retrieval failure (wrong/no sources).
  3. Context construction (too much noise / missing key facts).
  4. Generation failure (ignored context / violated contracts).
  5. Guardrail failure (false blocks or missed blocks).

3.3 Memory management

Context = what you pass to the model now.

How to do effective context engineering for agents:

  • Prompt engineering → context engineering: beyond writing instructions, you’re curating the full token set each turn (instructions, tools, message history, retrieved data, tool outputs).
  • Context is a finite “attention budget”: longer contexts can degrade recall/precision (“context rot”); treat tokens as scarce resources with diminishing returns.
  • Guiding principle: aim for the smallest high-signal context that reliably produces the desired behavior.
  • System prompt altitude (“Goldilocks”): avoid brittle pseudo-code / if-else logic in prompts and avoid vague guidance; be clear, direct, and structured.
  • Structure helps: organize instructions into explicit sections (background, constraints, tool guidance, output spec).
  • Examples beat rule-lists: prefer a small set of diverse, canonical few-shot examples over long edge-case lists.
  • Tools are context shapers: tool contracts should be token-efficient and unambiguous.
  • Progressive disclosure: keep metadata always; load details only when needed.

Long-horizon techniques (when the task outgrows the context window):

  • Compaction: periodically summarize and restart with a condensed state.
  • Tool-result clearing: once a tool result is “digested,” avoid re-including raw payloads deep in history.
  • Structured note-taking (agentic memory): persist TODOs, decisions, constraints outside the chat.
  • Sub-agent architectures: delegate deep exploration to subagents with clean contexts; return distilled summaries.

Context layers (handy breakdown):

  • Pinned context: system policy + tool schemas.
  • Working set: recent turns + latest tool results.
  • Retrieved context: RAG results pulled on-demand.
  • Summaries: compress long traces into structured state.

Common pitfalls:

  • mixing instructions with retrieved data (prompt injection risk),
  • dumping raw logs into context (token blow-up),
  • storing PII without consent.

3.4 Training / eval harness

Evaluate agents as systems:

  • Task success rate on realistic workflows.
  • Tool-call accuracy: correct tool, correct args, correct sequencing.
  • Time/cost: wall-clock, tokens, tool calls.
  • Safety: disallowed tool attempts, policy violations, PII leakage.

Build a replayable harness with:

  • a representative task set (“golden tasks”),
  • deterministic tool mocks (where possible),
  • and scorecards for success, cost/latency, and safety.

Interview drills

4.1 System design talk track template

One framework to answer most “Design an agent that does X” questions:

  • 1) Clarify X + constraints: what’s the input, output, latency/SLA, and what’s out of scope?
  • 2) Define success + risk metrics: task success rate, cost/latency, escalation rate, and domain-specific error rates (e.g., false approvals).
  • 3) Choose workflow vs agent: start with the simplest deterministic workflow that can hit the target; justify autonomy only for long-tail decisions.
  • 4) Specify tools + contracts: list read vs write tools, schemas, retries/timeouts, and idempotency for writes; enforce permissions/allowlists outside the model.
  • 5) Design state + context: what’s pinned (policy/tools), what’s retrieved (RAG), what’s in working memory, and what persists (notes/memory with provenance + TTL).
  • 6) Orchestration: single-agent loop first; add routing / planner–executor / evaluators / subagents only when it improves measured outcomes.
  • 7) Guardrails + human handoff: risk-rate tools/actions, require confirmation for high-blast-radius writes, and define clear escalation triggers.
  • 8) Evals + rollout: build replayable test sets, measure tool-call correctness + safety, ship in shadow/assist mode, then tighten budgets and monitoring.

4.2 Common Q&A

“For the toy news agent, what should the tool contracts look like?”

  • news.search: strict schema, bounded results, stable IDs/URLs, and clear error semantics (retryable vs not).
  • news.fetch: deterministic extraction, timeouts, size limits, and normalized fields (title/source/published_at/text).
  • Prefer returning summaries + pointers over huge payloads; keep token cost predictable.

“How do you prevent ‘search → fetch everything’ tool spam?”

  • Budgets: max tool calls, max sources, max tokens per tool result; stop when marginal gain falls.
  • Planner step must emit an action + rationale + stop criteria; log stop_reason.
  • Add lightweight heuristics: dedupe by URL/domain, prefer primary sources, diversify outlets.

“How do you evaluate whether the agent is actually doing good retrieval?”

  • Retrieval metrics: recall@k / MRR / nDCG on a labeled set of “topic → relevant articles”.
  • Behavioral metrics: source diversity, freshness, redundancy rate, and citation coverage (claims with ≥2 independent sources).
  • End-to-end: briefing faithfulness (citations support claims), contradiction rate, and abstention quality.

“How do you handle context growth over a 25-tool-call loop?”

  • Treat context as an attention budget: keep raw tool outputs out of the prompt once digested.
  • Maintain structured state (queries tried, sources metadata, extracted claims) and compact periodically.
  • Retrieve just-in-time via references (URLs/IDs) rather than carrying full documents.

“Where do agents like this fail in practice, technically?”

  • Ambiguous tools or overlapping capabilities → wrong tool choice.
  • Context pollution: too many low-signal snippets → missed key facts (“context rot”).
  • Citation drift: model states claims without anchoring to source spans.
  • Latency/cost blowups: unconstrained fetching, reranking, or repeated query rewrites.

“How do you make this safe and robust?”

  • Treat retrieved text as data: isolate it, forbid it from overriding instructions, and sanitize HTML.
  • Enforce allowlists + permissions outside the model; risk-rate tools even if they’re read-only.
  • Add validators: schema checks, citation-required checks, and retries with explicit error messages.
  • Human-in-the-loop triggers: conflicting sources, low confidence, repeated tool errors, or budget exhaustion.

Case study

5.1 Toy news agent

Goal: “Given a topic, autonomously explore the news landscape, iteratively refine queries, verify claims against sources, and produce a cited brief with an explicit stop reason.”

High-level architecture

  • Agent client (CLI/UI) collects user intent and displays results.
  • Agent runtime (e.g., fast_agent) runs the loop: decide → call tools → synthesize.
  • LLM provider (e.g., OpenRouter) supplies the model used for reasoning + summarization.
  • MCP server (e.g., fast_mcp) exposes tools like news.search and news.fetch.
NoteDefinitions (MCP + client + model routing)

agent client: the user-facing surface (CLI, web UI, Slack bot) that sends tasks to an agent runtime and renders results. It should not embed tool logic; keep that in the runtime/tooling layer.

MCP (Model Context Protocol): a standard protocol for advertising tools/resources to an agent runtime with consistent contracts. Interview intuition: it’s an integration layer that lets one client talk to many tools safely and consistently.

OpenRouter: a hosted API that routes a single OpenAI-compatible request format to many different model providers. Interview intuition: it’s a convenient way to swap models (or use multiple models) without rewriting your client.

Why an LLM “needs” MCP (interview framing)

  • The model itself doesn’t speak to your databases or APIs; an agent runtime does.
  • MCP makes “tool discovery + invocation” consistent across tools and across clients.
  • MCP centralizes governance: tool allowlists, permissions, and budgets can be enforced outside the model.

5.2 Tool contracts + RAG strategy + budgets + stop reasons

How MCP acts as an adapter

  • Agent client ↔︎ agent runtime: UX and session management.
  • Agent runtime ↔︎ LLM: completion/chat API (OpenAI-style, OpenRouter, etc.).
  • Agent runtime ↔︎ tools: MCP is the adapter layer that standardizes tool contracts and lets you register many tools behind one interface.

Toy toolset (registered on the MCP server)

  • news.search(query, since_days, limit) -> [{title, url, source, published_at}]
  • news.fetch(url) -> {url, title, text}
  • news.summarize(text, max_bullets) -> {bullets} (optional; many teams do summarization in the LLM instead)

RAG strategy (simple default):

  • Search broadly with news.search (diverse sources).
  • Fetch selectively with news.fetch (bounded per-article size).
  • Extract claims with citations; prefer ≥2 independent sources for strong claims.

Budgets and stop reasons:

  • Max sources, max tool calls, max wall-clock; stop when marginal gain is low.
  • Return stop_reason like coverage_sufficient, tool_budget_reached, or sources_conflict.
NoteDefinition: tool contract

A tool contract is the strict input/output interface an agent runtime relies on (names, types, required fields, error semantics). It’s what makes tool calls testable and safe.

Important features of tool contracts (what interviewers look for)

  • Deterministic outputs where possible (tools should not be “creative”).
  • Explicit types/shape: required vs optional fields; stable field names.
  • Clear failure semantics: structured errors, retryable vs non-retryable.
  • Timeouts and bounded work (tool must fail fast rather than hang).
  • Idempotency for writes (request IDs) and safe defaults.
  • Auth/permissions handled by the runtime, not the model.
  • Observability hooks: request IDs, tool name, latency, and minimal logging of sensitive fields.

Agentic loop (what you’d explain on a whiteboard)

  1. Client sends intent + constraints (topic, time window, preferred sources, max budget).
  2. Agent proposes a short research plan (subtopics, query variants, coverage targets).
  3. Agent runs iterative cycles:
    • call news.search with a query variant,
    • call news.fetch for the most promising sources,
    • extract candidate claims + attach citations,
    • assess coverage/novelty/credibility; decide what’s missing.
  4. Agent adapts: rewrite queries, broaden/narrow scope, or request clarification if constraints conflict.
  5. Stop when one condition is met: coverage threshold reached, marginal gains fall below a threshold, budget exhausted, or sources conflict (trigger “needs human review”).
  6. Return a structured brief (key findings + citations + confidence + stop reason).

5.3 Minimal pseudocode skeleton

Example skeleton (template, APIs vary by library)

MCP server (fast_mcp-style pseudocode):

# mcp_server.py

# Pseudocode: adjust imports/APIs to your fast_mcp version.

@tool(name="news.search")
def search(query: str, since_days: int = 7, limit: int = 5) -> list[dict]:
    """Return a list of candidate articles with URL + metadata."""
    ...


@tool(name="news.fetch")
def fetch(url: str) -> dict:
    """Fetch article text from a URL."""
    ...


def main():
    server = FastMCPServer(tools=[search, fetch])
    server.run()

Agent runtime (fast_agent-style pseudocode):

# agent_client.py

# Pseudocode: emphasize an adaptive loop (plan → act → observe → decide).

mcp = MCPClient("http://localhost:8000")
llm = OpenRouterChat(model="openai/gpt-4o-mini", api_key=os.environ["OPENROUTER_API_KEY"])

goal = {
    "topic": "AI agents",
    "since_days": 7,
    "max_sources": 12,
    "max_tool_calls": 25,
}

state = {
    "queries_tried": [],
    "sources": [],  # {url, title, source, published_at, text}
    "claims": [],   # {claim, citations:[url], confidence}
    "stop_reason": None,
}

while True:
    if len(state["sources"]) >= goal["max_sources"]:
        state["stop_reason"] = "source_budget_reached"
        break
    if tool_calls_used() >= goal["max_tool_calls"]:
        state["stop_reason"] = "tool_budget_reached"
        break

    plan = llm(
        "Propose the next best action (search/fetch), "
        "a query variant if needed, and what coverage is missing.",
        context={"goal": goal, "state": state},
    )

    if plan.action == "search":
        hits = mcp.call("news.search", query=plan.query, since_days=goal["since_days"], limit=10)
        state["queries_tried"].append(plan.query)
        state = update_state_with_hits(state, hits)

    elif plan.action == "fetch":
        doc = mcp.call("news.fetch", url=plan.url)
        state = update_state_with_doc(state, doc)

    elif plan.action == "stop":
        state["stop_reason"] = plan.reason  # e.g., "coverage_sufficient" / "sources_conflict"
        break

    # Extract/verify claims against collected sources (LLM-as-judge + strict citation rules).
    state["claims"] = llm(
        "Extract candidate claims and attach citations; "
        "if a claim lacks support, mark it as uncertain.",
        context={"sources": state["sources"]},
    )

brief = llm(
    "Write a structured briefing with citations, confidence, and stop_reason.",
    context={"goal": goal, "claims": state["claims"], "stop_reason": state["stop_reason"]},
)
print(brief)

## 5.4 Framework landscape

A bunch of other frameworks (including MCP-based stacks) support tool use in essentially the same way—this landscape is changing fast:

  • LangChain / LangGraph: large ecosystem, graph-based orchestration.
  • LlamaIndex: strong RAG + data connectors.
  • Semantic Kernel: enterprise-friendly agent + tool abstractions.
  • AutoGen / CrewAI: multi-agent coordination patterns.
  • Haystack: retrieval pipelines + production patterns.
  • DSPy / PydanticAI: programmatic prompting / typed structured outputs.
  • MCP ecosystem: tool servers + client runtimes that standardize tool wiring.

References