How to Cut LLM Costs 2026 Guide

LLM costs in 2024 were a nuisance for most teams. In 2026 they're a P&L line: a moderately-used AI feature at mid-market scale costs Rs 1-15 lakh per month in API calls. Without cost discipline, the same workload can cost 2-3x that.

This is the practical guide to the five highest-impact cost optimisations. Most are 30-minute changes that produce 20-50 percent cost reduction. A few require architectural rework. All of them are worth doing.

How LLM pricing actually works in 2026

Modern LLM API pricing has three components per request:

Component	What it means	Typical cost
Input tokens	The prompt + system message + context you send	Rs 1-25 per million tokens
Output tokens	The response the model generates	Rs 5-100 per million tokens
Cached input tokens	Input the provider has seen before + cached	10-50% of normal input cost

The 4-much higher premium on output tokens means: the model is expensive to make TALK, cheap to make LISTEN. Cost optimisation focuses on output token reduction first.

Anthropic Claude pricing (2026)

Model	Input (per 1M)	Output (per 1M)	Cached input
Claude Opus 4.7	Rs 1,250	Rs 6,250	Rs 125 (90% off)
Claude Sonnet 4.6	Rs 250	Rs 1,250	Rs 25 (90% off)
Claude Haiku 4.5	Rs 80	Rs 400	Rs 8 (90% off)

OpenAI pricing (2026)

Model	Input (per 1M)	Output (per 1M)	Cached input
GPT-5	Rs 1,000	Rs 4,000	Rs 500 (50% off)
GPT-4o	Rs 200	Rs 1,000	Rs 100 (50% off)
GPT-4o-mini	Rs 12	Rs 50	Rs 6 (50% off)

The pricing model + the discount structure differ across providers. The optimisation playbook is roughly the same.

Optimisation 1: prompt caching (the biggest win)

Most LLM applications send a large + relatively static system prompt + a small + dynamic user message. The system prompt is sent again on every request. Caching it cuts cost dramatically.

How prompt caching works

The provider hashes the prompt prefix. If the same prefix appears in a subsequent request within the cache TTL (typically 5 minutes for Anthropic, longer for OpenAI), the cached tokens cost 10-50 percent of normal input cost.

What to cache

System prompt + role definition (rarely changes)
Tool definitions (change only when you add / remove tools)
Long context documents (when many requests share the same document context, like a customer profile or a knowledge base chunk)
Few-shot examples (cache them once + reuse across requests)

What NOT to cache

The current user's input (changes every request)
Real-time data (changes by the second)
Tiny prompts (the caching overhead exceeds the saving for short prompts)

Implementation

For Anthropic Claude (Python SDK):

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)

The cache_control block marks the system prompt as cacheable. Subsequent requests with the same system prompt benefit from cache.

Expected savings

For a typical agent application with a 4,000-token system prompt + 200-token user messages:

Without caching: 4,200 input tokens per request
With caching (after first hit): 200 fresh + 4,000 cached
Cost reduction: ~85 percent on input tokens

For high-volume applications, this single optimisation cuts total cost by 30-50 percent.

Optimisation 2: batch APIs (50% off for non-real-time)

Anthropic + OpenAI both offer batch APIs: submit a batch of requests, get results back within 24 hours, pay 50 percent less than the synchronous API.

When batch APIs are the right choice

Offline classification (process yesterday's customer service tickets for sentiment + categorisation)
Content generation pipelines (generate product descriptions for a new catalog batch)
Eval suite runs (test 1,000 prompts against a new model version overnight)
Data enrichment (analyse 50,000 documents over the weekend)

When batch APIs are wrong

Real-time user-facing requests (a 24-hour wait is unacceptable)
Interactive workflows where the user expects a response in seconds

Implementation note

Most teams use the batch API for the wrong things (because it's "cheap") + miss the right things (because it requires architectural rework). The architectural shift: split your workload into "synchronous, real-time" vs "asynchronous, can wait". Then route appropriately.

Optimisation 3: model tiering (cheap for routing, expensive for reasoning)

The model that handles "is this customer's question about billing or product?" does not need to be the same model that handles "generate a personalised 500-word product recommendation". Routing the easy work to a cheaper model saves 70-95 percent.

The tiered architecture

USER REQUEST
   |
   v
[Haiku 4.5: classify + route]. routes to one of:
   |
   +--> [Haiku 4.5: simple Q&A]                  (90% of requests)
   +--> [Sonnet 4.6: complex reasoning + tools]  (9% of requests)
   +--> [Opus 4.7: high-stakes generation]       (1% of requests)

Most production agents use 90 percent Haiku/Sonnet + 10 percent Opus or equivalent.

Picking the right tier per task

Task type	Right model
Classification, routing, simple Q&A	Haiku 4.5 / GPT-4o-mini
Tool-use, multi-step reasoning, draft generation	Sonnet 4.6 / GPT-4o
High-stakes generation, complex analysis, code generation	Opus 4.7 / GPT-5

Common failure mode

Defaulting to the most powerful model + paying the premium on tasks that don't need it. The cost difference between Opus + Haiku is 15x; using Opus when Haiku would do is 15x overspend on those requests.

Optimisation 4: response truncation + structured output

If your application only needs the model to return a 50-word summary, capping max_tokens=200 means a misbehaving model can't blow your budget on a 4,000-word essay.

Structured output for further control

Use the provider's structured-output mode (JSON schema-typed responses on OpenAI; Tool Use + Anthropic's native JSON mode):

Forces the model to return parseable, fixed-shape responses
Prevents the model from padding with explanations
Reduces output tokens 30-60 percent vs free-form prose

For data extraction + classification + routing workflows, structured output is the right default.

Optimisation 5: prompt + context engineering

Most production prompts are bloated. Trimming saves cost on every request forever.

What to trim

Verbose role descriptions. "You are an expert customer service representative who is friendly, helpful, and knowledgeable about our company's products and policies, and you always strive to provide the best possible experience..." can become "You are a customer service agent for [Company]. Answer briefly + accurately."
Redundant instructions. Saying the same thing three different ways doesn't help the model + costs tokens.
Few-shot examples that don't change behaviour. Test whether removing each example degrades quality; remove the ones that don't.
Unused tool descriptions. If the agent uses 3 of 12 declared tools, ship only the 3 (with conditional inclusion of the others based on intent classification).

Context window discipline

For agents with conversation history, decide:

How many turns of history to keep (typically 5-10 most recent)
When to summarise older history into a compact summary
When to start fresh (reset context after task completion)

Conversation history grows linearly + cost grows with it. Discipline keeps it bounded.

Monitoring + cost attribution

Without monitoring you cannot improve.

The four metrics to track

Metric	How to measure
Cost per request	Sum of (input cost + output cost) per API call
Cost per user / per team / per feature	Tag every request with the upstream identifier + aggregate
Cache hit rate	Cached tokens / total input tokens
Wasted-token rate	Tokens consumed by failed or retried requests / total tokens

Alerting thresholds

Daily cost above N percent of monthly budget pace
Cache hit rate below 30 percent (suggests caching isn't working)
Wasted-token rate above 10 percent (suggests retries or errors are draining budget)
Single user / single feature consuming above N percent of cost (suggests abuse or design issue)

For Indian D2C + SaaS programmes running production AI features, monthly cost reviews + quarterly architecture audits are the cadence.

What this looks like in practice

A real-world before / after for an Indian SaaS at $3M ARR running a customer-service agent:

Before optimisation: - Model: Claude Opus 4.7 for everything - No caching - No batching - 12,000 conversations/month - Cost: Rs 14 lakh/month

After optimisation (6 weeks of work): - Model tiering: 70% Haiku, 25% Sonnet, 5% Opus - Prompt caching on system prompt + tool definitions (4,000 tokens cached) - Batch API for overnight ticket classification - Response truncation + structured output where applicable - 12,000 conversations/month - Cost: Rs 5.2 lakh/month

Net: 63 percent cost reduction at unchanged user-facing performance.

Common mistakes

Defaulting to the most expensive model. Most production workloads don't need it.
No prompt caching. The single highest-impact optimisation in 2026.
Long verbose prompts. Tokens add up; trim ruthlessly.
No max_tokens cap. A misbehaving model can blow your budget on a single request.
No cost monitoring. You cannot optimise what you don't measure.

Production checklist

For a production LLM application running cost-effectively:

Prompt caching enabled on system prompts + tool definitions
Model tiering: cheap model for routing + simple, mid-tier for main reasoning, top-tier only for hard cases
Batch API used for asynchronous workloads
Structured output for data-extraction + classification tasks
max_tokens cap on every request
Conversation history trimming policy
Cost-per-request + cost-per-feature monitoring
Daily / weekly / monthly cost dashboards
Alerting on cost anomalies
Quarterly prompt audit + trim pass
Cache hit rate above 50 percent (target)
Wasted-token rate below 5 percent (target)

References + linked context

Dcrayons glossary: zero-data-retention, model-context-protocol, eval-suite
Dcrayons reference architectures: Production Claude on AWS Mumbai, Enterprise AI Agent Architecture, Enterprise RAG Architecture

LLM cost is the easiest line item to ignore + the most consequential to optimise. If your AI programme is burning cash on token bills, reach out via the contact form for a 30-minute review.

Tagsllm-costanthropicopenaiprompt-cachinghow-toblog

How to Cut Your LLM Costs by 50% in 2026: Caching, Batching, and Model Tiering