How to reduce LLM inference latency and token costs?
The pain everyone eventually hits but nobody budgets for: LLM unit economics. Learn how to reduce costs and latency without gutting quality.
You launch a prototype, everyone loves it, usage climbs… and suddenly latency blows up, token bills look like a Series A round, and product wants "just one more feature" that adds 3 more model calls per request. This post is about how to reduce LLM inference latency and token costs without gutting quality.
1. The Unit Economics Problem (You Can't Ignore This)
LLM apps have brutal cost dynamics:
- You pay per token, both in and out
- Latency grows with tokens, model size, and call count
- Most teams design UX and pipelines like tokens are free
It's fine when you're at 1,000 requests/day. At 1,000,000+ requests/day, every extra 500 input tokens and every unnecessary model call is real money and real user pain.
So you need to think like this:
For this user interaction, how many model calls, at what size, with how many tokens, at what latency — and what revenue or value is attached?
That's unit economics.
2. Where Your Latency and Costs Actually Come From
Break it down per request:
- Number of model calls – Single-call vs multi-step agents vs chains-of-thought prompts
- Model choice – Big, slow, expensive vs small, fast, cheap
- Token volume – Prompt bloat (huge system prompts, examples, context), over-long responses
- Infrastructure overhead – Cold starts, network hops, crappy batching, underutilized GPUs/TPUs
- Extra stuff – Reranking calls, LLM-as-a-judge calls, secondary tools (classifiers, extractors)
Your job is to attack each dimension without killing quality.
3. First Lever: Stop Wasting Tokens
Before you touch quantization or fancy routing, do the obvious thing: send less junk.
3.1 Clean up prompts
- Strip boilerplate you mindlessly copy-pasted from prompt Reddit
- Shorten system instructions to what actually matters
- Turn multi-paragraph tone guides into a single clear sentence
Bad: "You are a super-intelligent AI system that always does X, Y, Z, writes like Hemingway, cares deeply about empathy, blah blah..."
Better: "Answer concisely, in 3–5 bullet points, using a direct and professional tone."
3.2 Control context expansion
RAG is amazing — and also a token bomb if you're lazy.
- Limit the number of retrieved chunks
- Use smaller chunk sizes with smart overlap
- Rerank and drop marginally relevant chunks instead of dumping 20 docs into context
- Use query classification: If the question is simple or generic, skip retrieval entirely. If it clearly needs docs, then retrieve.
3.3 Control output length
You don't need 10 paragraphs when 5 bullet points will do.
- Be explicit: "Answer in ≤150 words" or "No more than 5 bullet points."
- For APIs: "Return only valid JSON, no explanation text."
Less output tokens → lower cost and (usually) lower latency.
4. Model Routing: Use the Cheap Stuff First
This is the biggest structural win: not every request needs your largest, fanciest model.
4.1 Tiered model strategy
Set up at least two tiers:
- Tier 1: small / cheap model (e.g., GPT-4o-mini, distilled model) – Use for: simple classification, short queries, low-risk tasks, things where occasional minor errors are acceptable
- Tier 2: large / expensive model – Use for: complex reasoning, ambiguous high-value user queries, anything user-facing where quality is critical
4.2 How to route
You can:
- Use rules-based routing: If prompt length < X and task = "simple classification" → small model. If context length > Y or multi-step reasoning required → big model.
- Or LLM-as-router: Cheap model (or special router head) looks at the request and decides if it's "easy" or "hard". Only send "hard" to the expensive model.
4.3 Optimization loop
Track: % of traffic going to each tier, quality metrics by tier (user feedback, eval scores), cost/latency per tier.
Goal: max traffic on the cheap tier without breaking quality thresholds.
5. Semantic Caching: Don't Pay Twice for the Same Work
Semantic caching is underused and criminally effective.
5.1 What it is
Instead of just caching exact prompts, you:
- Compute an embedding for the user query
- Look up similar past queries in a vector cache
- If you find one above a similarity threshold: reuse the previous answer (or lightly adapt it), skip the full model call
This helps for: repeated FAQs, very similar support questions, recurrent internal queries ("What's our PTO policy?" ×100).
5.2 How to do it safely
- Set a similarity threshold that's strict enough not to mis-answer edge cases
- Store: prompt, context used, final answer, any metadata (time, doc version, etc.)
- Invalidate cache or lower trust when: docs are updated, policy versions change, the answer depends on rapidly changing data
5.3 ROI
Semantic caching reduces token costs (no or fewer new calls) and latency (cache hit is near-instant). It's especially powerful at scale: as usage grows, cache hit rates improve.
6. Quantization: Squeezing More Out of Your Hardware
If you're hosting models yourself (or using open models), quantization is a huge lever for latency and cost.
6.1 What quantization does
- Converts model weights from higher precision (e.g., fp16) to lower precision (e.g., int8, int4)
- This: shrinks model size, improves memory bandwidth, often improves throughput and reduces latency
You keep most of the performance while getting more inferences per GPU.
6.2 AWQ, GPTQ, etc. (high level)
You don't need to be a PhD here, just know:
- GPTQ: Post-training quantization method, often used for 4-bit quant of LLMs
- AWQ: Activation-aware weight quantization, tends to preserve quality better for some models
- There are also: QLoRA-style training with quantized base, other int4/int8 schemes
Your choice depends on: model architecture, hardware target, tolerance for small quality drop vs speed gain.
6.3 Why this matters for unit economics
Quantization gives you: more concurrent requests per GPU, lower per-request latency, lower infra cost (fewer or smaller machines).
If you're at scale and not using quantization for self-hosted models, you're burning money for fun.
7. Reduce Round-Trips: Flatten the Orchestration
A lot of "agentic" systems die on unit economics because they do this:
- Call LLM to decide what to do
- Call tool
- Call LLM to interpret tool
- Call another tool
- Call LLM again for final answer
That's 3–5+ LLM calls per user query.
7.1 Strategies to reduce call count
- Combine steps: Use a single call to both decide and answer, when safe
- Pre-plan: For certain flows (e.g., known form filling), design a fixed sequence instead of open-ended agents
- Use cheaper models for planning, expensive model only for final user-facing text
- Use non-LLM logic where you can: simple conditionals, heuristics, classic classifiers
7.2 Measure and cap
For each endpoint, define:
- Max allowed number of LLM calls per request
- Target and absolute max latency
- Target and absolute max token budget
If an agent wants to go beyond that, fail gracefully or return partial results instead of spinning out.
8. Infra Tuning: Batching, Streaming, and Deployment Details
Once you've done routing, caching, quantization, and token dieting, infra tuning is the last big lever.
8.1 Batching
If you control the serving stack:
- Batch multiple requests per forward pass where latency budget allows
- Great for: background jobs, LLM-as-a-judge evaluations, non-interactive workloads
8.2 Streaming responses
Streaming doesn't reduce actual latency, but it improves perceived latency:
- User sees the answer in 200–500ms, even if full generation takes 2–3 seconds
- Also lets you: cut off long generations early if the user abandons the request, enforce max tokens dynamically
8.3 Deployment details that matter
- Keep models hot (avoid cold starts)
- Put inference endpoints close to your users or latency-sensitive services
- Monitor: GPU utilization, queue times, per-request breakdown (network vs compute)
This is standard ML infra hygiene, but it matters more when every 100ms and every cent per request scales with usage.
9. Putting It All Together: A Practical Playbook
Here's a pragmatic order of operations to fix your unit economics.
- Measure the baseline – For each endpoint: avg/max latency, avg tokens in/out, avg calls per request, cost per 1K requests
- Cut obvious waste – Shorten prompts, trim context, constrain answer length
- Add model routing – Define "easy" vs "hard" queries, route easy traffic to a cheaper/smaller model
- Add semantic caching – Cache common queries with their context + answers, track hit rate and savings
- Quantize (if self-hosting) – Move from fp16 → int8/int4 where quality allows, re-measure latency and throughput
- Reduce orchestration hops – Merge LLM calls where reasonable, replace LLM logic with classic code where you can
- Tune infra – Batching for non-interactive workloads, streaming for interactive, fix obvious deployment inefficiencies
Re-run the numbers and calculate savings per 1K/100K/1M requests. That's your real unit economics win.
10. The mindset: treat tokens like money and latency like churn
If you're serious about AI/ML ops:
- Tokens are not an abstraction — they're direct cost.
- Latency is not just "performance" — it's user experience and conversion.
- Model choice, routing, caching, and quantization are financial levers, not just fun engineering toys.
You don't have to do everything at once, but you can't pretend this doesn't matter once your app sees real traffic.
11. Practical Architecture: How to Actually Build This
Let's turn the theory into something concrete you can implement.
11.1 High-Level Architecture Overview
Think of your LLM stack as three layers:
- Edge / API layer – Receives user requests, handles auth/rate limiting/validation, talks to the "Brain" service
- Brain (Orchestration) layer – Request classifier & router, semantic cache, RAG retrieval (optional), calls model backends, applies post-processing
- Model & Data layer – Cheap model backend (GPT-4o-mini / quantized small model), expensive model backend (larger model), vector DB / search index (for RAG), metrics + logging store
11.2 Request Flow: Step-by-Step
Step 1: API receives request
Input: user_id, text, task_type, and optional metadata (tenant, language, flags). Quick checks: auth, basic input length, traffic sampling flags. Then forward to Brain service.
Step 2: Lightweight classification & routing decision
First thing in Brain:
- Task classification (cheap model or rules): task type (
qa,summarize,classify,code, etc.), complexity score (simple vs complex), risk level (low vs high) - Routing decision: easy + low-risk → cheap model path; complex or high-risk → expensive model path
This can be a simple rules engine, or a tiny router model (logistic regression, small LLM, or fine-tuned classifier).
Step 3: Semantic cache lookup
Before you spend tokens, check cache:
- Compute embedding for entire user query (for free-form Q&A), and/or normalized key (e.g., "faq:refund_policy")
- Hit vector cache: If semantic similarity > threshold → cache hit (return cached answer); if no hit → continue
Cache entry stores: query_embedding, normalized_query, user_query_example, answer, source_docs, doc_version, created_at, metadata (language, tenant_id, model_used).
Invalidate cache when: docs are re-indexed, policies/versions change, tenant data changes.
Step 4: Optional RAG retrieval
If the task type requires document grounding:
- Normalize/expand query (cheap LLM or rule-based rewrite)
- Query vector DB / hybrid search: Return top-N candidates (e.g., 20)
- Rerank (optional): Use a lightweight reranker
- Select final context: Drop marginal hits, merge or trim chunks to fit token budget
Step 5: Construct prompt with token discipline
Before calling any model:
- Use minimal system prompt tuned for the task
- Inject only the top K context chunks (K chosen per SKU/task)
- Explicitly constrain: output length, format (JSON, bullets), tone
Example structure for QA with context:
System: You are a concise assistant for [product]. Use ONLY the context below. If you don't know, say you don't know.
Context: [chunk 1] [chunk 2] ...
User: [user question]
Assistant rules: Answer in at most 5 bullet points. Do not invent facts not supported by the context.
Step 6: Call the routed model backend
Choose the model based on routing decision:
- If easy + low-risk → cheap model backend (GPT-4o-mini, quantized small open model)
- Else → expensive model backend (larger model)
If self-hosting: cheap backend is smaller and heavily quantized (int4/int8, AWQ/GPTQ); expensive backend is larger, maybe mixed precision with some quantization.
Step 7: Post-processing, logging, and optional caching
After getting the model output:
- Validate format – JSON? Use schema validator. If invalid, optionally do a cheap "repair" pass.
- Apply guardrails – Domain-specific filters, safety checks / redaction
- Log everything – Raw user query, routing choice + model used, tokens in/out, latency breakdown, cache hit status, context doc IDs / versions
- Cache the result (if cacheable) – For FAQs and stable answers, store embeddings + answer + doc version
11.3 Architecture Diagram (Text Version)
[ Client / Frontend ]
|
v
[ API Gateway / Edge ]
|
v
[ Brain Service ]
| | \
| | \
[Router] [Semantic [RAG Retrieval]
Cache] |
\ |
\ v
\ [Context Builder]
\ |
\ v
--> [Model Backend Selector]
|
+---------------+----------------+
| |
v v
[Cheap Model Backend] [Expensive Model Backend]
| |
+---------------+----------------+
|
[Post-Processor]
|
[Logging + Metrics]
|
v
[Response]
11.4 Concrete Routing Rules (Starting Point)
Start with simple but effective rules:
- If
task_type in {classification, tag_prediction, sentiment}→ usecheap_model - If
len(user_query) < 128 tokensand no RAG required → usecheap_model - If RAG required and
sum(context_tokens) < 512andtask_type = faq_qna→ trycheap_model, fallback toexpensive_modelif confidence low - If
domain in {compliance, pricing, legalish}or user isenterprise_tier→ default toexpensive_model
Refine later with: data-driven router (train classifier on past success/fail), LLM-as-router ("Is this easy or hard?" with a small model).
11.5 Where Quantization Fits In
If self-hosting models:
- Cheap backend: Aggressively quantized (int4/8, AWQ/GPTQ), tuned for throughput
- Expensive backend: Larger model, maybe partially quantized, tuned more for quality than raw speed
Expose them behind a unified interface:
POST /llm-inference
{
"model_tier": "cheap" | "expensive",
"prompt": "...",
"max_tokens": 256,
"temperature": 0.2
}
The Brain doesn't care if it's GPTQ/AWQ/whatever under the hood — it just picks the tier.
11.6 Metrics You Absolutely Need
For each request, log at least:
request_id,user_id/tenant_id(or hashed)task_type,model_tier(cheap / expensive),model_nametokens_in,tokens_outlatency_total_ms,latency_model_ms,latency_retrieval_mscache_hit(true/false),rag_used(true/false)num_llm_calls
Then build dashboards: cost per 1K requests by endpoint + model tier, P50/P90/P99 latency by endpoint + model tier, cache hit rate over time, % routed to cheap vs expensive models, quality proxy (LLM-as-judge on sample traffic).
That's how you turn all this into real unit economics data, not vibes.
11.7 How This Actually Reduces Latency & Cost
You get multiplicative gains because:
- Routing keeps most traffic on the cheap/fast path
- Semantic caching makes repeated queries nearly free
- RAG token discipline keeps prompts small
- Quantization boosts throughput and lowers infra cost
- Reduced call count (simpler orchestration) cuts both tokens and latency
Nothing here is exotic. It's just a coherent design instead of a pile of ad-hoc hacks.