Enterprise AI Agent Reference Architecture

The conversational chatbot was 2023. The retrieval-augmented assistant was 2024. The production agent that takes actions, holds state across tasks, and integrates with the rest of your stack is 2025-2026. The architecture under those agents looks different from a RAG pipeline + an LLM. Without that difference, agents fail interestingly in dev + then catastrophically in prod.

This is the reference architecture we apply on enterprise AI agent engagements. It covers tool design, Model Context Protocol (MCP) integration, sandbox isolation, the eval suite that makes agents production-grade, and the cost + safety discipline that keeps the programme defensible.

What "agent" actually means in 2026

The word has been diluted. For this piece:

An AI agent is an LLM-driven system that takes a task, decides on a sequence of tool calls, executes them, observes results, and continues until the task is complete or it determines it cannot complete it.

The difference from a chatbot: chatbots answer questions; agents act on the world. The difference from a RAG pipeline: RAG retrieves + answers; agents retrieve + decide + act.

Three agent patterns in production

Pattern	Use case	Risk surface
Read-only research agent	Summarise documents, answer questions from internal data, prepare briefings	Low. Cannot mutate state.
Write-back agent	Update CRM records, send templated emails, post to ticketing systems	Medium. Mutations are reversible but inconvenient.
Action agent	Process refunds, modify production data, execute trades, deploy code	High. Mistakes are expensive or irreversible.

The architecture differs materially per pattern. Building all three with the same engineering effort under-protects the high-risk ones + over-engineers the low-risk ones.

The architecture layers

A production agent stack has five distinct layers.

Layer 1: the model

The LLM that powers reasoning + tool selection. Anthropic Claude (Opus + Sonnet + Haiku tiers), OpenAI GPT-5 + GPT-4o, Google Gemini 2.5, Meta Llama 3.3 for self-hosted, Mistral for European data-residency needs.

For enterprise India workloads, the dominant choice in 2026 is Claude on AWS Bedrock in ap-south-1 (Mumbai) for data-residency + ZDR posture. Fallback to direct Anthropic API for development.

Layer 2: the tool layer

The tools the agent can call. Each tool has:

A name + schema (input parameters + output shape)
An implementation (the actual code that executes)
A permission scope (which agents can call this tool + under which conditions)
An audit log (every call recorded with caller + timestamp + outcome)

The 2024-2025 standardisation: Model Context Protocol (MCP) lets you write a tool once + expose it to any MCP-compatible agent (Claude, Cursor, custom orchestrators). Before MCP, every agent framework had its own tool format; integration was per-agent. MCP collapsed that to one format.

Layer 3: the orchestration layer

The code that runs the agent loop:

LOOP:
  Send conversation + available tools to LLM
  LLM responds with text OR tool call
  IF tool call:
    Validate inputs
    Execute tool
    Append result to conversation
    Continue loop
  IF text response:
    Return to user OR continue if multi-step task
  IF max iterations reached OR token budget exceeded:
    Bail with partial result

The orchestration layer is where most production agent bugs live. The model loops forever, calls the wrong tool, retries the same failing tool 47 times, or terminates without producing a useful answer.

Layer 4: the safety layer

What separates a demo from production:

Input sanitisation (the user's prompt is treated as data, not instructions)
Output validation (does the agent's response match the schema the downstream system expects)
Tool-call rate limiting (no more than N calls per minute per agent per user)
Action gates (high-risk tools require explicit human approval before execution)
Audit logging (every prompt + every tool call + every response stored)

Layer 5: the observability layer

The agent's behaviour is otherwise opaque:

Trace-per-conversation (every prompt + tool call + result captured)
Cost-per-conversation tracking
Latency p50/p95/p99 per tool call
Error rate per tool + per agent
Eval scores tracked against held-out test sets

Without observability, you cannot debug agents in production. With it, you can.

Tool design: the highest-use decision

The tools you expose to the agent shape what it can do + how reliably it does it.

Principle 1: small, well-named, single-purpose tools

A tool called search_orders that takes customer_email + date_range + status_filter is better than one tool called query_database that takes a SQL string. The narrower the tool, the easier the model picks it correctly + the lower the risk of misuse.

Principle 2: explicit schemas, not freeform

Tools accept structured input (JSON schema). Tools return structured output (JSON or typed objects). Freeform text-in + text-out tools force the model to do parsing + interpretation work it gets wrong.

Principle 3: error messages the agent can act on

When a tool fails, the error message tells the agent WHY + WHAT TO DO:

Bad: "Error: invalid request"
Good: "Error: customer_email is required + must be a valid email. Pass a properly formatted email + retry."

Models are much better at recovering from descriptive errors than from generic ones.

Principle 4: idempotency where possible

If the agent retries a tool call (network blip, timeout), the second call should produce the same result as the first, not duplicate the side-effect. Idempotency keys on writes, retries built into reads.

Principle 5: permission scoping at the tool level

Each tool declares: which agent classes can call this tool, under what conditions, with what approval required. A read-only research agent does not have the send_email tool available; the customer-service agent does not have the process_refund tool without supervisor approval gate.

MCP: when to use the open standard vs custom

Model Context Protocol is Anthropic's open standard for connecting AI assistants to external data sources + tools. Adopted by Cursor, Claude Desktop, Continue, and increasingly by the Anthropic API itself in 2025-2026.

MCP makes sense when...

You're building tools that multiple agents will consume (the same search_crm tool used by the customer-service agent + the sales-research agent)
You're using off-the-shelf agent frameworks (Claude Code, Cursor) that already speak MCP
You want tool implementations to live in the customer's infrastructure + the agent to live elsewhere (the MCP server is the boundary)

Custom orchestration makes sense when...

Your agent loop has business-specific logic that doesn't fit MCP's stateless model
Latency is critical + the MCP roundtrip adds too much overhead
You're building a single-purpose agent where the engineering investment in MCP isn't justified

For enterprise programmes with 5+ tools + 2+ agents, MCP is usually the right architectural choice. For a single agent with 2-3 tools, custom orchestration is simpler.

Sandbox isolation: containing the blast radius

Agents make mistakes. The architecture decision is how big a mistake can become.

Three sandbox tiers

Tier	What it means	When to use
Read-only sandbox	Agent reads from production but writes to a staging mirror	Research, analysis, briefing agents
Write-through with rollback	Agent writes to production but every change is reversible (CRM, ticketing, draft modes)	Internal-tool agents, customer service
Production-direct with gates	Agent acts on production with explicit human approval per action	Refunds, deployments, financial transactions

The mistake teams make: treating every agent as if it needs production-direct access. Most don't. A research agent that can write back its findings as a draft email (not a sent email) does 90 percent of the useful work with 5 percent of the risk.

What goes in the sandbox vs outside

Inside the sandbox (controlled, audited): - Tool calls - Data the agent reads - Side effects (writes, sends, deploys) - Error recovery attempts

Outside the sandbox (the human): - High-risk tool execution approval - Reviewing the agent's plan before it executes - Final delivery (the agent drafts; the human sends)

The eval suite: what makes agents production-grade

Agents drift silently. The eval suite is what catches drift before it becomes an incident.

Held-out task set with known good outcomes

50-200 tasks sampled from real production traffic (anonymised) or written by domain experts. Each has:

The input task description
The expected sequence of tool calls (loose or strict)
The expected output (when applicable)
The grading rubric

Per-task metrics

Metric	What it measures
Task success rate	Did the agent complete the task
Tool selection accuracy	Did the agent pick the right tools
Tool call efficiency	How many calls did it take vs the optimal
Latency p50 / p95	How long did the agent take
Cost per task	Tokens consumed + tool execution cost
Safety violations	Any safety-gate trips, escalations, or unauthorised actions

Regression suite

Every change to the agent (model version upgrade, tool changes, prompt updates, MCP server updates) runs through the eval suite. Score deltas vs baseline are flagged + blocked if material.

Production sample monitoring

Daily sample of real agent runs gets graded (LLM-as-judge for some metrics, human review for safety-relevant samples). Score drift triggers investigation.

Cost discipline

Production agents add up. The cost lines:

Cost line	Typical monthly at mid-market scale
LLM API (Claude Sonnet / Opus / OpenAI equivalent)	Rs 1-15 lakh
Self-hosted model infrastructure (if applicable)	Rs 2-25 lakh
Tool execution infrastructure (database queries, API calls, sandboxes)	Rs 50K-5 lakh
Observability + audit log storage	Rs 25K-2 lakh
MCP server hosting (if applicable)	Rs 25K-1 lakh

For Indian D2C + SaaS programmes deploying agents at meaningful scale, monthly LLM cost typically lands Rs 2-12 lakh in year 1.

The cost optimisation levers

Model tiering: Haiku for cheap routing, Sonnet for the main reasoning, Opus only for the hard cases
Prompt caching: Reuse cached system prompts across requests (Claude + GPT-4o both support this; 50-90 percent cost reduction on the cached portion)
Batch processing: Non-real-time tasks via batch APIs at materially lower price
Response truncation: Cap max output tokens; the model doesn't need to fill the whole context budget
Tool-call early-termination: When the answer is reachable in 3 tool calls but the loop continues, end early

Governance + safety

Three governance components:

Approval gates per agent class

Agent class	Pre-approval needed for
Read-only research	Nothing (already low-risk)
Write-back customer service	Refunds > Rs 5,000, escalations to legal, public communications
Action agent	Every production write

Audit log

Every prompt + every tool call + every response stored for 12+ months. Indexed for query. Reviewed monthly for anomalies (tool calls outside business hours, unusual patterns, repeated failures).

Red team + adversarial testing

Quarterly: a designated team tries to break the agent. Prompt injection, social engineering, escalation attempts. Findings feed into the safety layer + the eval suite.

What does this look like in production?

A real-world snapshot of an Indian SaaS at $5M ARR running a customer-service agent:

Model: Claude Sonnet 4.6 on AWS Bedrock ap-south-1 (Mumbai)
Tools: 14 MCP-exposed tools covering customer-lookup, ticket-creation, knowledge-base-search, refund-request, escalate-to-human
Sandbox tier: Write-through with rollback for ticket updates; production-direct with gate for refunds above Rs 5,000
Eval suite: 120 held-out tasks + regression on every change
Observability: LangSmith for traces; Datadog for latency + errors; custom warehouse for cost tracking
Cost: Rs 8.5 lakh/month at 12,000 conversations/month (~Rs 70/conversation, dropping over time as caching matures)
Performance: 71 percent task success rate, average 4.2 tool calls per task, 18 seconds median full-scope

Production checklist

For a production AI agent programme:

Risk tier per agent documented (read-only / write-back / action)
Model + region selected (Claude Bedrock ap-south-1 is default for Indian enterprise)
MCP server (or custom orchestration) deployed
Tools designed: small, well-named, schema-typed, idempotent, with descriptive errors
Permission scopes per tool per agent
Sandbox tier matched to risk tier
Approval gates configured for high-risk actions
Eval suite (50+ tasks) running on every change
Observability: trace + cost + latency + error tracking
Audit log retention (12+ months)
Monthly governance review + quarterly red-team
Cost monitoring + budget alerts

References + linked context

Dcrayons glossary: model-context-protocol, zero-data-retention, vector-database, eval-suite
Dcrayons reference architectures: Production Claude on AWS Mumbai, Enterprise RAG Architecture

Agents are the most operationally consequential AI deployment most enterprises run in 2026. If your programme is at the build-vs-buy decision point, the production-readiness gap, or the governance question, reach out via the contact form for a 30-minute review.

Tagsai-agentsmcptool-useclaudeenterpriseblog

Enterprise AI Agent Architecture: Tool-Use, MCP, and the Production Discipline