Cut Your LLM Bill by 70% in 2026 — 9 Proven Tactics

A playbook from teams running millions of LLM calls a month. Real numbers, working code, zero vendor fluff.

The average early-stage AI startup burns 32% of their runway on LLM APIs. The best-run ones burn under 10%. Same product. Same quality. The difference is nine engineering habits, none of them hero work.

Here they are, ranked by biggest-bang-for-effort.

1 Route cheap models for cheap tasks

The single highest-leverage decision: stop calling GPT-5.5 for everything. Most pipelines have 3-5 distinct jobs, each with different quality requirements:

Typical saving: 60-80% on bill. Effort: 1-2 days of routing logic + evals.

2 Turn on prompt caching

If you have a long system prompt (1K+ tokens) that's sent on every request, you're paying for it every time. All four major providers now support caching:

ProviderCache read discountTTL
OpenAI (GPT-5.5)~87% off~5-10 min
Anthropic (Opus 4.7)90% off1 hour
Google (Gemini 3.1 Pro)75% off1+ hour
DeepSeek V475% off1 hour

Typical saving: 30-50% on high-traffic apps. Effort: ~1 hour (just add cache_control markers).

3 Measure before you cut

You can't optimize what you can't see. Before touching prod:

  1. Log (model, prompt_tokens, completion_tokens, cost) per request to a columnar store (ClickHouse, DuckDB, BigQuery).
  2. Build a daily dashboard: cost by endpoint, by model, by customer.
  3. Use TokenScope to sanity-check your actual prompt lengths before switching models.

90% of cost blowouts are one runaway endpoint nobody noticed. Find it first.

🔎 Count tokens across all models, free

TokenScope: paste prompt, see cost in GPT-5.5, Claude, DeepSeek, Gemini side-by-side.

Open TokenScope →

4 Trim your system prompt

System prompts balloon over time — every hotfix adds a rule. After 6 months many teams ship 4K-token system prompts where 800 would do.

Quick process:

  1. Dump the current prompt into Gemini 3.1 Pro with its 2M context.
  2. Ask: "Compress this to 1/3 length while preserving all behavioral constraints. Output the shorter version."
  3. A/B test on 200 real prompts. If eval score drops <2%, ship it.

Typical saving: 10-25% of input token spend. Effort: 2-4 hours.

5 Use batch mode for non-realtime jobs

OpenAI, Anthropic, and Google all offer batch APIs at 50% discount with 24-hour SLA. If your job can tolerate async completion (nightly reports, document ingestion, training-data generation, classification backfills), batch is free money.

# OpenAI batch example — 50% off
import openai
batch = openai.batches.create(
    input_file_id="file-abc",
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Typical saving: 50% on batchable workloads. Effort: ~4 hours to add a queue.

6 Cap max_tokens aggressively

Output tokens are 3-6x more expensive than input. Yet most developers set max_tokens=4096 "just in case." Real responses are usually 50-500 tokens.

Set per-endpoint limits based on p95 of actual output length. Cap hallucinated long responses. For GPT-5.5 that's literally saving $25/1M output tokens you'd otherwise pay.

7 Use a cheaper gateway

If you're paying OpenAI / Anthropic directly from Asia, Africa, or Latin America, you're losing 3-5% to FX + card fees, 5-10% to latency-driven retries, and unable to pay in local methods.

An OpenAI-compatible gateway with Hong Kong servers (like NovAI) fixes this:

8 Add an LLM-cache layer for identical prompts

If your product has repeated queries (FAQ chatbot, docs assistant, translation of common strings), a semantic cache (Redis + embedding similarity) can hit 30-60% cache rate. Response time drops to sub-10ms and API cost drops to zero on cached hits.

Tools: gptcache, langchain-cache, or a 50-line custom hit-check on a sentence embedding + TTL.

9 Escalate only when confidence is low

Pattern: run the cheap model first (DeepSeek V4 / Gemini Flash). If the output is low-confidence (model says "I'm not sure" / JSON parse fails / logprob low), re-run on Opus 4.7.

def answer(prompt):
    cheap = call("deepseek-v4", prompt)
    if cheap.confidence > 0.9 or not_critical(prompt):
        return cheap
    # escalate ~5% of calls
    return call("claude-opus-4-7", prompt)

Typical savings: 70-90% vs going frontier-only, with quality within 1-2% on evals.

One Gateway for All 9 Tactics

NovAI gives you routing, caching, USDT billing, and 30+ models in one OpenAI-compatible API. $0.50 free credit.

Get Free API Key →

A real-world case study

A customer support SaaS, 3M LLM calls/month on GPT-4o baseline:

ConfigMonthly costSavings
Baseline (100% GPT-5.5)$18,400
+ prompt caching$11,040−40%
+ route classification → DeepSeek V3.2$6,260−66%
+ semantic cache (30% hit)$4,380−76%
+ batch non-realtime jobs$3,480−81%

$18.4K → $3.5K. Same product. Eval scores within 1.2% of baseline on their golden test set. Implementation time: 3 weeks of one engineer.

What not to do

TL;DR checklist

  1. ✅ Log every call's (model, tokens, cost) — 1 day
  2. ✅ Turn on prompt caching — 1 hour
  3. ✅ Cap max_tokens to p95 of real responses — 1 hour
  4. ✅ Route non-frontier tasks to DeepSeek V4 / Gemini Flash — 1-2 days
  5. ✅ Add batch mode for async jobs — 1 day
  6. ✅ Add semantic cache for repeated queries — 1 day
  7. ✅ Trim system prompt by 50-70% — 4 hours
  8. ✅ Add escalation (cheap model first, frontier on low confidence) — 2 days
  9. ✅ Switch to an OpenAI-compatible gateway for better prices/latency — 30 min

Expected total: 60-80% cost cut. Worst-case quality regression: 1-3%. Best investment of engineering time in 2026.

Related Articles

TokenScope: Free Token Counter → DeepSeek V4 vs GPT-5.5 → Gemini 3.1 Pro API Guide → AI API Pricing Comparison →
Cut your LLM bill 70% with NovAI's multi-model gatewayGet Free API Key →