LLM costs in 2024 were a nuisance for most teams. In 2026 they're a P&L line: a moderately-used AI feature at mid-market scale costs Rs 1-15 lakh per month in API calls. Without cost discipline, the same workload can cost 2-3x that.
This is the practical guide to the five highest-impact cost optimisations. Most are 30-minute changes that produce 20-50 percent cost reduction. A few require architectural rework. All of them are worth doing.
How LLM pricing actually works in 2026
Modern LLM API pricing has three components per request:
| Component | What it means | Typical cost |
|---|---|---|
| Input tokens | The prompt + system message + context you send | Rs 1-25 per million tokens |
| Output tokens | The response the model generates | Rs 5-100 per million tokens |
| Cached input tokens | Input the provider has seen before + cached | 10-50% of normal input cost |
The 4-much higher premium on output tokens means: the model is expensive to make TALK, cheap to make LISTEN. Cost optimisation focuses on output token reduction first.
Anthropic Claude pricing (2026)
| Model | Input (per 1M) | Output (per 1M) | Cached input |
|---|---|---|---|
| Claude Opus 4.7 | Rs 1,250 | Rs 6,250 | Rs 125 (90% off) |
| Claude Sonnet 4.6 | Rs 250 | Rs 1,250 | Rs 25 (90% off) |
| Claude Haiku 4.5 | Rs 80 | Rs 400 | Rs 8 (90% off) |
OpenAI pricing (2026)
| Model | Input (per 1M) | Output (per 1M) | Cached input |
|---|---|---|---|
| GPT-5 | Rs 1,000 | Rs 4,000 | Rs 500 (50% off) |
| GPT-4o | Rs 200 | Rs 1,000 | Rs 100 (50% off) |
| GPT-4o-mini | Rs 12 | Rs 50 | Rs 6 (50% off) |
The pricing model + the discount structure differ across providers. The optimisation playbook is roughly the same.
Optimisation 1: prompt caching (the biggest win)
Most LLM applications send a large + relatively static system prompt + a small + dynamic user message. The system prompt is sent again on every request. Caching it cuts cost dramatically.
How prompt caching works
The provider hashes the prompt prefix. If the same prefix appears in a subsequent request within the cache TTL (typically 5 minutes for Anthropic, longer for OpenAI), the cached tokens cost 10-50 percent of normal input cost.
What to cache
- System prompt + role definition (rarely changes)
- Tool definitions (change only when you add / remove tools)
- Long context documents (when many requests share the same document context, like a customer profile or a knowledge base chunk)
- Few-shot examples (cache them once + reuse across requests)
What NOT to cache
- The current user's input (changes every request)
- Real-time data (changes by the second)
- Tiny prompts (the caching overhead exceeds the saving for short prompts)
Implementation
For Anthropic Claude (Python SDK):
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
The cache_control block marks the system prompt as cacheable. Subsequent requests with the same system prompt benefit from cache.
Expected savings
For a typical agent application with a 4,000-token system prompt + 200-token user messages:
- Without caching: 4,200 input tokens per request
- With caching (after first hit): 200 fresh + 4,000 cached
- Cost reduction: ~85 percent on input tokens
For high-volume applications, this single optimisation cuts total cost by 30-50 percent.
Optimisation 2: batch APIs (50% off for non-real-time)
Anthropic + OpenAI both offer batch APIs: submit a batch of requests, get results back within 24 hours, pay 50 percent less than the synchronous API.
When batch APIs are the right choice
- Offline classification (process yesterday's customer service tickets for sentiment + categorisation)
- Content generation pipelines (generate product descriptions for a new catalog batch)
- Eval suite runs (test 1,000 prompts against a new model version overnight)
- Data enrichment (analyse 50,000 documents over the weekend)
When batch APIs are wrong
- Real-time user-facing requests (a 24-hour wait is unacceptable)
- Interactive workflows where the user expects a response in seconds
Implementation note
Most teams use the batch API for the wrong things (because it's "cheap") + miss the right things (because it requires architectural rework). The architectural shift: split your workload into "synchronous, real-time" vs "asynchronous, can wait". Then route appropriately.
Optimisation 3: model tiering (cheap for routing, expensive for reasoning)
The model that handles "is this customer's question about billing or product?" does not need to be the same model that handles "generate a personalised 500-word product recommendation". Routing the easy work to a cheaper model saves 70-95 percent.
The tiered architecture
USER REQUEST
|
v
[Haiku 4.5: classify + route]. routes to one of:
|
+--> [Haiku 4.5: simple Q&A] (90% of requests)
+--> [Sonnet 4.6: complex reasoning + tools] (9% of requests)
+--> [Opus 4.7: high-stakes generation] (1% of requests)
Most production agents use 90 percent Haiku/Sonnet + 10 percent Opus or equivalent.
Picking the right tier per task
| Task type | Right model |
|---|---|
| Classification, routing, simple Q&A | Haiku 4.5 / GPT-4o-mini |
| Tool-use, multi-step reasoning, draft generation | Sonnet 4.6 / GPT-4o |
| High-stakes generation, complex analysis, code generation | Opus 4.7 / GPT-5 |
Common failure mode
Defaulting to the most powerful model + paying the premium on tasks that don't need it. The cost difference between Opus + Haiku is 15x; using Opus when Haiku would do is 15x overspend on those requests.
Optimisation 4: response truncation + structured output
If your application only needs the model to return a 50-word summary, capping max_tokens=200 means a misbehaving model can't blow your budget on a 4,000-word essay.
Structured output for further control
Use the provider's structured-output mode (JSON schema-typed responses on OpenAI; Tool Use + Anthropic's native JSON mode):
- Forces the model to return parseable, fixed-shape responses
- Prevents the model from padding with explanations
- Reduces output tokens 30-60 percent vs free-form prose
For data extraction + classification + routing workflows, structured output is the right default.
Optimisation 5: prompt + context engineering
Most production prompts are bloated. Trimming saves cost on every request forever.
What to trim
- Verbose role descriptions. "You are an expert customer service representative who is friendly, helpful, and knowledgeable about our company's products and policies, and you always strive to provide the best possible experience..." can become "You are a customer service agent for [Company]. Answer briefly + accurately."
- Redundant instructions. Saying the same thing three different ways doesn't help the model + costs tokens.
- Few-shot examples that don't change behaviour. Test whether removing each example degrades quality; remove the ones that don't.
- Unused tool descriptions. If the agent uses 3 of 12 declared tools, ship only the 3 (with conditional inclusion of the others based on intent classification).
Context window discipline
For agents with conversation history, decide:
- How many turns of history to keep (typically 5-10 most recent)
- When to summarise older history into a compact summary
- When to start fresh (reset context after task completion)
Conversation history grows linearly + cost grows with it. Discipline keeps it bounded.
Monitoring + cost attribution
Without monitoring you cannot improve.
The four metrics to track
| Metric | How to measure |
|---|---|
| Cost per request | Sum of (input cost + output cost) per API call |
| Cost per user / per team / per feature | Tag every request with the upstream identifier + aggregate |
| Cache hit rate | Cached tokens / total input tokens |
| Wasted-token rate | Tokens consumed by failed or retried requests / total tokens |
Alerting thresholds
- Daily cost above N percent of monthly budget pace
- Cache hit rate below 30 percent (suggests caching isn't working)
- Wasted-token rate above 10 percent (suggests retries or errors are draining budget)
- Single user / single feature consuming above N percent of cost (suggests abuse or design issue)
For Indian D2C + SaaS programmes running production AI features, monthly cost reviews + quarterly architecture audits are the cadence.
What this looks like in practice
A real-world before / after for an Indian SaaS at $3M ARR running a customer-service agent:
Before optimisation: - Model: Claude Opus 4.7 for everything - No caching - No batching - 12,000 conversations/month - Cost: Rs 14 lakh/month
After optimisation (6 weeks of work): - Model tiering: 70% Haiku, 25% Sonnet, 5% Opus - Prompt caching on system prompt + tool definitions (4,000 tokens cached) - Batch API for overnight ticket classification - Response truncation + structured output where applicable - 12,000 conversations/month - Cost: Rs 5.2 lakh/month
Net: 63 percent cost reduction at unchanged user-facing performance.
Common mistakes
- Defaulting to the most expensive model. Most production workloads don't need it.
- No prompt caching. The single highest-impact optimisation in 2026.
- Long verbose prompts. Tokens add up; trim ruthlessly.
- No max_tokens cap. A misbehaving model can blow your budget on a single request.
- No cost monitoring. You cannot optimise what you don't measure.
Production checklist
For a production LLM application running cost-effectively:
- Prompt caching enabled on system prompts + tool definitions
- Model tiering: cheap model for routing + simple, mid-tier for main reasoning, top-tier only for hard cases
- Batch API used for asynchronous workloads
- Structured output for data-extraction + classification tasks
- max_tokens cap on every request
- Conversation history trimming policy
- Cost-per-request + cost-per-feature monitoring
- Daily / weekly / monthly cost dashboards
- Alerting on cost anomalies
- Quarterly prompt audit + trim pass
- Cache hit rate above 50 percent (target)
- Wasted-token rate below 5 percent (target)
References + linked context
- Dcrayons glossary: zero-data-retention, model-context-protocol, eval-suite
- Dcrayons reference architectures: Production Claude on AWS Mumbai, Enterprise AI Agent Architecture, Enterprise RAG Architecture
LLM cost is the easiest line item to ignore + the most consequential to optimise. If your AI programme is burning cash on token bills, reach out via the contact form for a 30-minute review.



