The average early-stage AI startup burns 32% of their runway on LLM APIs. The best-run ones burn under 10%. Same product. Same quality. The difference is nine engineering habits, none of them hero work.
Here they are, ranked by biggest-bang-for-effort.
The single highest-leverage decision: stop calling GPT-5.5 for everything. Most pipelines have 3-5 distinct jobs, each with different quality requirements:
Typical saving: 60-80% on bill. Effort: 1-2 days of routing logic + evals.
If you have a long system prompt (1K+ tokens) that's sent on every request, you're paying for it every time. All four major providers now support caching:
| Provider | Cache read discount | TTL |
|---|---|---|
| OpenAI (GPT-5.5) | ~87% off | ~5-10 min |
| Anthropic (Opus 4.7) | 90% off | 1 hour |
| Google (Gemini 3.1 Pro) | 75% off | 1+ hour |
| DeepSeek V4 | 75% off | 1 hour |
Typical saving: 30-50% on high-traffic apps. Effort: ~1 hour (just add cache_control markers).
You can't optimize what you can't see. Before touching prod:
(model, prompt_tokens, completion_tokens, cost) per request to a columnar store (ClickHouse, DuckDB, BigQuery).90% of cost blowouts are one runaway endpoint nobody noticed. Find it first.
TokenScope: paste prompt, see cost in GPT-5.5, Claude, DeepSeek, Gemini side-by-side.
Open TokenScope →System prompts balloon over time — every hotfix adds a rule. After 6 months many teams ship 4K-token system prompts where 800 would do.
Quick process:
Typical saving: 10-25% of input token spend. Effort: 2-4 hours.
OpenAI, Anthropic, and Google all offer batch APIs at 50% discount with 24-hour SLA. If your job can tolerate async completion (nightly reports, document ingestion, training-data generation, classification backfills), batch is free money.
# OpenAI batch example — 50% off
import openai
batch = openai.batches.create(
input_file_id="file-abc",
endpoint="/v1/chat/completions",
completion_window="24h",
)
Typical saving: 50% on batchable workloads. Effort: ~4 hours to add a queue.
max_tokens aggressivelyOutput tokens are 3-6x more expensive than input. Yet most developers set max_tokens=4096 "just in case." Real responses are usually 50-500 tokens.
Set per-endpoint limits based on p95 of actual output length. Cap hallucinated long responses. For GPT-5.5 that's literally saving $25/1M output tokens you'd otherwise pay.
If you're paying OpenAI / Anthropic directly from Asia, Africa, or Latin America, you're losing 3-5% to FX + card fees, 5-10% to latency-driven retries, and unable to pay in local methods.
An OpenAI-compatible gateway with Hong Kong servers (like NovAI) fixes this:
If your product has repeated queries (FAQ chatbot, docs assistant, translation of common strings), a semantic cache (Redis + embedding similarity) can hit 30-60% cache rate. Response time drops to sub-10ms and API cost drops to zero on cached hits.
Tools: gptcache, langchain-cache, or a 50-line custom hit-check on a sentence embedding + TTL.
Pattern: run the cheap model first (DeepSeek V4 / Gemini Flash). If the output is low-confidence (model says "I'm not sure" / JSON parse fails / logprob low), re-run on Opus 4.7.
def answer(prompt):
cheap = call("deepseek-v4", prompt)
if cheap.confidence > 0.9 or not_critical(prompt):
return cheap
# escalate ~5% of calls
return call("claude-opus-4-7", prompt)
Typical savings: 70-90% vs going frontier-only, with quality within 1-2% on evals.
NovAI gives you routing, caching, USDT billing, and 30+ models in one OpenAI-compatible API. $0.50 free credit.
Get Free API Key →A customer support SaaS, 3M LLM calls/month on GPT-4o baseline:
| Config | Monthly cost | Savings |
|---|---|---|
| Baseline (100% GPT-5.5) | $18,400 | — |
| + prompt caching | $11,040 | −40% |
| + route classification → DeepSeek V3.2 | $6,260 | −66% |
| + semantic cache (30% hit) | $4,380 | −76% |
| + batch non-realtime jobs | $3,480 | −81% |
$18.4K → $3.5K. Same product. Eval scores within 1.2% of baseline on their golden test set. Implementation time: 3 weeks of one engineer.
max_tokens to p95 of real responses — 1 hourExpected total: 60-80% cost cut. Worst-case quality regression: 1-3%. Best investment of engineering time in 2026.