The conversational chatbot was 2023. The retrieval-augmented assistant was 2024. The production agent that takes actions, holds state across tasks, and integrates with the rest of your stack is 2025-2026. The architecture under those agents looks different from a RAG pipeline + an LLM. Without that difference, agents fail interestingly in dev + then catastrophically in prod.
This is the reference architecture we apply on enterprise AI agent engagements. It covers tool design, Model Context Protocol (MCP) integration, sandbox isolation, the eval suite that makes agents production-grade, and the cost + safety discipline that keeps the programme defensible.
What "agent" actually means in 2026
The word has been diluted. For this piece:
An AI agent is an LLM-driven system that takes a task, decides on a sequence of tool calls, executes them, observes results, and continues until the task is complete or it determines it cannot complete it.
The difference from a chatbot: chatbots answer questions; agents act on the world. The difference from a RAG pipeline: RAG retrieves + answers; agents retrieve + decide + act.
Three agent patterns in production
| Pattern | Use case | Risk surface |
|---|---|---|
| Read-only research agent | Summarise documents, answer questions from internal data, prepare briefings | Low. Cannot mutate state. |
| Write-back agent | Update CRM records, send templated emails, post to ticketing systems | Medium. Mutations are reversible but inconvenient. |
| Action agent | Process refunds, modify production data, execute trades, deploy code | High. Mistakes are expensive or irreversible. |
The architecture differs materially per pattern. Building all three with the same engineering effort under-protects the high-risk ones + over-engineers the low-risk ones.
The architecture layers
A production agent stack has five distinct layers.
Layer 1: the model
The LLM that powers reasoning + tool selection. Anthropic Claude (Opus + Sonnet + Haiku tiers), OpenAI GPT-5 + GPT-4o, Google Gemini 2.5, Meta Llama 3.3 for self-hosted, Mistral for European data-residency needs.
For enterprise India workloads, the dominant choice in 2026 is Claude on AWS Bedrock in ap-south-1 (Mumbai) for data-residency + ZDR posture. Fallback to direct Anthropic API for development.
Layer 2: the tool layer
The tools the agent can call. Each tool has:
- A name + schema (input parameters + output shape)
- An implementation (the actual code that executes)
- A permission scope (which agents can call this tool + under which conditions)
- An audit log (every call recorded with caller + timestamp + outcome)
The 2024-2025 standardisation: Model Context Protocol (MCP) lets you write a tool once + expose it to any MCP-compatible agent (Claude, Cursor, custom orchestrators). Before MCP, every agent framework had its own tool format; integration was per-agent. MCP collapsed that to one format.
Layer 3: the orchestration layer
The code that runs the agent loop:
LOOP:
Send conversation + available tools to LLM
LLM responds with text OR tool call
IF tool call:
Validate inputs
Execute tool
Append result to conversation
Continue loop
IF text response:
Return to user OR continue if multi-step task
IF max iterations reached OR token budget exceeded:
Bail with partial result
The orchestration layer is where most production agent bugs live. The model loops forever, calls the wrong tool, retries the same failing tool 47 times, or terminates without producing a useful answer.
Layer 4: the safety layer
What separates a demo from production:
- Input sanitisation (the user's prompt is treated as data, not instructions)
- Output validation (does the agent's response match the schema the downstream system expects)
- Tool-call rate limiting (no more than N calls per minute per agent per user)
- Action gates (high-risk tools require explicit human approval before execution)
- Audit logging (every prompt + every tool call + every response stored)
Layer 5: the observability layer
The agent's behaviour is otherwise opaque:
- Trace-per-conversation (every prompt + tool call + result captured)
- Cost-per-conversation tracking
- Latency p50/p95/p99 per tool call
- Error rate per tool + per agent
- Eval scores tracked against held-out test sets
Without observability, you cannot debug agents in production. With it, you can.
Tool design: the highest-use decision
The tools you expose to the agent shape what it can do + how reliably it does it.
Principle 1: small, well-named, single-purpose tools
A tool called search_orders that takes customer_email + date_range + status_filter is better than one tool called query_database that takes a SQL string. The narrower the tool, the easier the model picks it correctly + the lower the risk of misuse.
Principle 2: explicit schemas, not freeform
Tools accept structured input (JSON schema). Tools return structured output (JSON or typed objects). Freeform text-in + text-out tools force the model to do parsing + interpretation work it gets wrong.
Principle 3: error messages the agent can act on
When a tool fails, the error message tells the agent WHY + WHAT TO DO:
- Bad:
"Error: invalid request" - Good:
"Error: customer_email is required + must be a valid email. Pass a properly formatted email + retry."
Models are much better at recovering from descriptive errors than from generic ones.
Principle 4: idempotency where possible
If the agent retries a tool call (network blip, timeout), the second call should produce the same result as the first, not duplicate the side-effect. Idempotency keys on writes, retries built into reads.
Principle 5: permission scoping at the tool level
Each tool declares: which agent classes can call this tool, under what conditions, with what approval required. A read-only research agent does not have the send_email tool available; the customer-service agent does not have the process_refund tool without supervisor approval gate.
MCP: when to use the open standard vs custom
Model Context Protocol is Anthropic's open standard for connecting AI assistants to external data sources + tools. Adopted by Cursor, Claude Desktop, Continue, and increasingly by the Anthropic API itself in 2025-2026.
MCP makes sense when...
- You're building tools that multiple agents will consume (the same
search_crmtool used by the customer-service agent + the sales-research agent) - You're using off-the-shelf agent frameworks (Claude Code, Cursor) that already speak MCP
- You want tool implementations to live in the customer's infrastructure + the agent to live elsewhere (the MCP server is the boundary)
Custom orchestration makes sense when...
- Your agent loop has business-specific logic that doesn't fit MCP's stateless model
- Latency is critical + the MCP roundtrip adds too much overhead
- You're building a single-purpose agent where the engineering investment in MCP isn't justified
For enterprise programmes with 5+ tools + 2+ agents, MCP is usually the right architectural choice. For a single agent with 2-3 tools, custom orchestration is simpler.
Sandbox isolation: containing the blast radius
Agents make mistakes. The architecture decision is how big a mistake can become.
Three sandbox tiers
| Tier | What it means | When to use |
|---|---|---|
| Read-only sandbox | Agent reads from production but writes to a staging mirror | Research, analysis, briefing agents |
| Write-through with rollback | Agent writes to production but every change is reversible (CRM, ticketing, draft modes) | Internal-tool agents, customer service |
| Production-direct with gates | Agent acts on production with explicit human approval per action | Refunds, deployments, financial transactions |
The mistake teams make: treating every agent as if it needs production-direct access. Most don't. A research agent that can write back its findings as a draft email (not a sent email) does 90 percent of the useful work with 5 percent of the risk.
What goes in the sandbox vs outside
Inside the sandbox (controlled, audited): - Tool calls - Data the agent reads - Side effects (writes, sends, deploys) - Error recovery attempts
Outside the sandbox (the human): - High-risk tool execution approval - Reviewing the agent's plan before it executes - Final delivery (the agent drafts; the human sends)
The eval suite: what makes agents production-grade
Agents drift silently. The eval suite is what catches drift before it becomes an incident.
Held-out task set with known good outcomes
50-200 tasks sampled from real production traffic (anonymised) or written by domain experts. Each has:
- The input task description
- The expected sequence of tool calls (loose or strict)
- The expected output (when applicable)
- The grading rubric
Per-task metrics
| Metric | What it measures |
|---|---|
| Task success rate | Did the agent complete the task |
| Tool selection accuracy | Did the agent pick the right tools |
| Tool call efficiency | How many calls did it take vs the optimal |
| Latency p50 / p95 | How long did the agent take |
| Cost per task | Tokens consumed + tool execution cost |
| Safety violations | Any safety-gate trips, escalations, or unauthorised actions |
Regression suite
Every change to the agent (model version upgrade, tool changes, prompt updates, MCP server updates) runs through the eval suite. Score deltas vs baseline are flagged + blocked if material.
Production sample monitoring
Daily sample of real agent runs gets graded (LLM-as-judge for some metrics, human review for safety-relevant samples). Score drift triggers investigation.
Cost discipline
Production agents add up. The cost lines:
| Cost line | Typical monthly at mid-market scale |
|---|---|
| LLM API (Claude Sonnet / Opus / OpenAI equivalent) | Rs 1-15 lakh |
| Self-hosted model infrastructure (if applicable) | Rs 2-25 lakh |
| Tool execution infrastructure (database queries, API calls, sandboxes) | Rs 50K-5 lakh |
| Observability + audit log storage | Rs 25K-2 lakh |
| MCP server hosting (if applicable) | Rs 25K-1 lakh |
For Indian D2C + SaaS programmes deploying agents at meaningful scale, monthly LLM cost typically lands Rs 2-12 lakh in year 1.
The cost optimisation levers
- Model tiering: Haiku for cheap routing, Sonnet for the main reasoning, Opus only for the hard cases
- Prompt caching: Reuse cached system prompts across requests (Claude + GPT-4o both support this; 50-90 percent cost reduction on the cached portion)
- Batch processing: Non-real-time tasks via batch APIs at materially lower price
- Response truncation: Cap max output tokens; the model doesn't need to fill the whole context budget
- Tool-call early-termination: When the answer is reachable in 3 tool calls but the loop continues, end early
Governance + safety
Three governance components:
Approval gates per agent class
| Agent class | Pre-approval needed for |
|---|---|
| Read-only research | Nothing (already low-risk) |
| Write-back customer service | Refunds > Rs 5,000, escalations to legal, public communications |
| Action agent | Every production write |
Audit log
Every prompt + every tool call + every response stored for 12+ months. Indexed for query. Reviewed monthly for anomalies (tool calls outside business hours, unusual patterns, repeated failures).
Red team + adversarial testing
Quarterly: a designated team tries to break the agent. Prompt injection, social engineering, escalation attempts. Findings feed into the safety layer + the eval suite.
What does this look like in production?
A real-world snapshot of an Indian SaaS at $5M ARR running a customer-service agent:
- Model: Claude Sonnet 4.6 on AWS Bedrock ap-south-1 (Mumbai)
- Tools: 14 MCP-exposed tools covering customer-lookup, ticket-creation, knowledge-base-search, refund-request, escalate-to-human
- Sandbox tier: Write-through with rollback for ticket updates; production-direct with gate for refunds above Rs 5,000
- Eval suite: 120 held-out tasks + regression on every change
- Observability: LangSmith for traces; Datadog for latency + errors; custom warehouse for cost tracking
- Cost: Rs 8.5 lakh/month at 12,000 conversations/month (~Rs 70/conversation, dropping over time as caching matures)
- Performance: 71 percent task success rate, average 4.2 tool calls per task, 18 seconds median full-scope
Production checklist
For a production AI agent programme:
- Risk tier per agent documented (read-only / write-back / action)
- Model + region selected (Claude Bedrock ap-south-1 is default for Indian enterprise)
- MCP server (or custom orchestration) deployed
- Tools designed: small, well-named, schema-typed, idempotent, with descriptive errors
- Permission scopes per tool per agent
- Sandbox tier matched to risk tier
- Approval gates configured for high-risk actions
- Eval suite (50+ tasks) running on every change
- Observability: trace + cost + latency + error tracking
- Audit log retention (12+ months)
- Monthly governance review + quarterly red-team
- Cost monitoring + budget alerts
References + linked context
- Dcrayons glossary: model-context-protocol, zero-data-retention, vector-database, eval-suite
- Dcrayons reference architectures: Production Claude on AWS Mumbai, Enterprise RAG Architecture
Agents are the most operationally consequential AI deployment most enterprises run in 2026. If your programme is at the build-vs-buy decision point, the production-readiness gap, or the governance question, reach out via the contact form for a 30-minute review.



