Ops2025-12-08

How to reduce LLM inference latency and token costs?

The pain everyone eventually hits but nobody budgets for: LLM unit economics. Learn how to reduce costs and latency without gutting quality.

You launch a prototype, everyone loves it, usage climbs… and suddenly latency blows up, token bills look like a Series A round, and product wants "just one more feature" that adds 3 more model calls per request. This post is about how to reduce LLM inference latency and token costs without gutting quality.

1. The Unit Economics Problem (You Can't Ignore This)

LLM apps have brutal cost dynamics:

You pay per token, both in and out
Latency grows with tokens, model size, and call count
Most teams design UX and pipelines like tokens are free

It's fine when you're at 1,000 requests/day. At 1,000,000+ requests/day, every extra 500 input tokens and every unnecessary model call is real money and real user pain.

So you need to think like this:

For this user interaction, how many model calls, at what size, with how many tokens, at what latency — and what revenue or value is attached?

That's unit economics.

2. Where Your Latency and Costs Actually Come From

Break it down per request:

Number of model calls – Single-call vs multi-step agents vs chains-of-thought prompts
Model choice – Big, slow, expensive vs small, fast, cheap
Token volume – Prompt bloat (huge system prompts, examples, context), over-long responses
Infrastructure overhead – Cold starts, network hops, crappy batching, underutilized GPUs/TPUs
Extra stuff – Reranking calls, LLM-as-a-judge calls, secondary tools (classifiers, extractors)

Your job is to attack each dimension without killing quality.

3. First Lever: Stop Wasting Tokens

Before you touch quantization or fancy routing, do the obvious thing: send less junk.

3.1 Clean up prompts

Strip boilerplate you mindlessly copy-pasted from prompt Reddit
Shorten system instructions to what actually matters
Turn multi-paragraph tone guides into a single clear sentence

Bad: "You are a super-intelligent AI system that always does X, Y, Z, writes like Hemingway, cares deeply about empathy, blah blah..."

Better: "Answer concisely, in 3–5 bullet points, using a direct and professional tone."

3.2 Control context expansion

RAG is amazing — and also a token bomb if you're lazy.

Limit the number of retrieved chunks
Use smaller chunk sizes with smart overlap
Rerank and drop marginally relevant chunks instead of dumping 20 docs into context
Use query classification: If the question is simple or generic, skip retrieval entirely. If it clearly needs docs, then retrieve.

3.3 Control output length

You don't need 10 paragraphs when 5 bullet points will do.

Be explicit: "Answer in ≤150 words" or "No more than 5 bullet points."
For APIs: "Return only valid JSON, no explanation text."

Less output tokens → lower cost and (usually) lower latency.

4. Model Routing: Use the Cheap Stuff First

This is the biggest structural win: not every request needs your largest, fanciest model.

4.1 Tiered model strategy

Set up at least two tiers:

Tier 1: small / cheap model (e.g., GPT-4o-mini, distilled model) – Use for: simple classification, short queries, low-risk tasks, things where occasional minor errors are acceptable
Tier 2: large / expensive model – Use for: complex reasoning, ambiguous high-value user queries, anything user-facing where quality is critical

4.2 How to route

You can:

Use rules-based routing: If prompt length < X and task = "simple classification" → small model. If context length > Y or multi-step reasoning required → big model.
Or LLM-as-router: Cheap model (or special router head) looks at the request and decides if it's "easy" or "hard". Only send "hard" to the expensive model.

4.3 Optimization loop

Track: % of traffic going to each tier, quality metrics by tier (user feedback, eval scores), cost/latency per tier.

Goal: max traffic on the cheap tier without breaking quality thresholds.

5. Semantic Caching: Don't Pay Twice for the Same Work

Semantic caching is underused and criminally effective.

5.1 What it is

Instead of just caching exact prompts, you:

Compute an embedding for the user query
Look up similar past queries in a vector cache
If you find one above a similarity threshold: reuse the previous answer (or lightly adapt it), skip the full model call

This helps for: repeated FAQs, very similar support questions, recurrent internal queries ("What's our PTO policy?" ×100).

5.2 How to do it safely

Set a similarity threshold that's strict enough not to mis-answer edge cases
Store: prompt, context used, final answer, any metadata (time, doc version, etc.)
Invalidate cache or lower trust when: docs are updated, policy versions change, the answer depends on rapidly changing data

5.3 ROI

Semantic caching reduces token costs (no or fewer new calls) and latency (cache hit is near-instant). It's especially powerful at scale: as usage grows, cache hit rates improve.

6. Quantization: Squeezing More Out of Your Hardware

If you're hosting models yourself (or using open models), quantization is a huge lever for latency and cost.

6.1 What quantization does

Converts model weights from higher precision (e.g., fp16) to lower precision (e.g., int8, int4)
This: shrinks model size, improves memory bandwidth, often improves throughput and reduces latency

You keep most of the performance while getting more inferences per GPU.

6.2 AWQ, GPTQ, etc. (high level)

You don't need to be a PhD here, just know:

GPTQ: Post-training quantization method, often used for 4-bit quant of LLMs
AWQ: Activation-aware weight quantization, tends to preserve quality better for some models
There are also: QLoRA-style training with quantized base, other int4/int8 schemes

Your choice depends on: model architecture, hardware target, tolerance for small quality drop vs speed gain.

6.3 Why this matters for unit economics

Quantization gives you: more concurrent requests per GPU, lower per-request latency, lower infra cost (fewer or smaller machines).

If you're at scale and not using quantization for self-hosted models, you're burning money for fun.

7. Reduce Round-Trips: Flatten the Orchestration

A lot of "agentic" systems die on unit economics because they do this:

Call LLM to decide what to do
Call tool
Call LLM to interpret tool
Call another tool
Call LLM again for final answer

That's 3–5+ LLM calls per user query.

7.1 Strategies to reduce call count

Combine steps: Use a single call to both decide and answer, when safe
Pre-plan: For certain flows (e.g., known form filling), design a fixed sequence instead of open-ended agents
Use cheaper models for planning, expensive model only for final user-facing text
Use non-LLM logic where you can: simple conditionals, heuristics, classic classifiers

7.2 Measure and cap

For each endpoint, define:

Max allowed number of LLM calls per request
Target and absolute max latency
Target and absolute max token budget

If an agent wants to go beyond that, fail gracefully or return partial results instead of spinning out.

8. Infra Tuning: Batching, Streaming, and Deployment Details

Once you've done routing, caching, quantization, and token dieting, infra tuning is the last big lever.

8.1 Batching

If you control the serving stack:

Batch multiple requests per forward pass where latency budget allows
Great for: background jobs, LLM-as-a-judge evaluations, non-interactive workloads

8.2 Streaming responses

Streaming doesn't reduce actual latency, but it improves perceived latency:

User sees the answer in 200–500ms, even if full generation takes 2–3 seconds
Also lets you: cut off long generations early if the user abandons the request, enforce max tokens dynamically

8.3 Deployment details that matter

Keep models hot (avoid cold starts)
Put inference endpoints close to your users or latency-sensitive services
Monitor: GPU utilization, queue times, per-request breakdown (network vs compute)

This is standard ML infra hygiene, but it matters more when every 100ms and every cent per request scales with usage.

9. Putting It All Together: A Practical Playbook

Here's a pragmatic order of operations to fix your unit economics.

Measure the baseline – For each endpoint: avg/max latency, avg tokens in/out, avg calls per request, cost per 1K requests
Cut obvious waste – Shorten prompts, trim context, constrain answer length
Add model routing – Define "easy" vs "hard" queries, route easy traffic to a cheaper/smaller model
Add semantic caching – Cache common queries with their context + answers, track hit rate and savings
Quantize (if self-hosting) – Move from fp16 → int8/int4 where quality allows, re-measure latency and throughput
Reduce orchestration hops – Merge LLM calls where reasonable, replace LLM logic with classic code where you can
Tune infra – Batching for non-interactive workloads, streaming for interactive, fix obvious deployment inefficiencies

Re-run the numbers and calculate savings per 1K/100K/1M requests. That's your real unit economics win.

10. The mindset: treat tokens like money and latency like churn

If you're serious about AI/ML ops:

Tokens are not an abstraction — they're direct cost.
Latency is not just "performance" — it's user experience and conversion.
Model choice, routing, caching, and quantization are financial levers, not just fun engineering toys.

You don't have to do everything at once, but you can't pretend this doesn't matter once your app sees real traffic.

11. Practical Architecture: How to Actually Build This

Let's turn the theory into something concrete you can implement.

11.1 High-Level Architecture Overview

Think of your LLM stack as three layers:

Edge / API layer – Receives user requests, handles auth/rate limiting/validation, talks to the "Brain" service
Brain (Orchestration) layer – Request classifier & router, semantic cache, RAG retrieval (optional), calls model backends, applies post-processing
Model & Data layer – Cheap model backend (GPT-4o-mini / quantized small model), expensive model backend (larger model), vector DB / search index (for RAG), metrics + logging store

11.2 Request Flow: Step-by-Step

Step 1: API receives request

Input: user_id, text, task_type, and optional metadata (tenant, language, flags). Quick checks: auth, basic input length, traffic sampling flags. Then forward to Brain service.

Step 2: Lightweight classification & routing decision

First thing in Brain:

Task classification (cheap model or rules): task type (qa, summarize, classify, code, etc.), complexity score (simple vs complex), risk level (low vs high)
Routing decision: easy + low-risk → cheap model path; complex or high-risk → expensive model path

This can be a simple rules engine, or a tiny router model (logistic regression, small LLM, or fine-tuned classifier).

Step 3: Semantic cache lookup

Before you spend tokens, check cache:

Compute embedding for entire user query (for free-form Q&A), and/or normalized key (e.g., "faq:refund_policy")
Hit vector cache: If semantic similarity > threshold → cache hit (return cached answer); if no hit → continue

Cache entry stores: query_embedding, normalized_query, user_query_example, answer, source_docs, doc_version, created_at, metadata (language, tenant_id, model_used).

Invalidate cache when: docs are re-indexed, policies/versions change, tenant data changes.

Step 4: Optional RAG retrieval

If the task type requires document grounding:

Normalize/expand query (cheap LLM or rule-based rewrite)
Query vector DB / hybrid search: Return top-N candidates (e.g., 20)
Rerank (optional): Use a lightweight reranker
Select final context: Drop marginal hits, merge or trim chunks to fit token budget

Step 5: Construct prompt with token discipline

Before calling any model:

Use minimal system prompt tuned for the task
Inject only the top K context chunks (K chosen per SKU/task)
Explicitly constrain: output length, format (JSON, bullets), tone

Example structure for QA with context:

System: You are a concise assistant for [product]. Use ONLY the context below. If you don't know, say you don't know.

Context: [chunk 1] [chunk 2] ...

User: [user question]

Assistant rules: Answer in at most 5 bullet points. Do not invent facts not supported by the context.

Step 6: Call the routed model backend

Choose the model based on routing decision:

If easy + low-risk → cheap model backend (GPT-4o-mini, quantized small open model)
Else → expensive model backend (larger model)

If self-hosting: cheap backend is smaller and heavily quantized (int4/int8, AWQ/GPTQ); expensive backend is larger, maybe mixed precision with some quantization.

Step 7: Post-processing, logging, and optional caching

After getting the model output:

Validate format – JSON? Use schema validator. If invalid, optionally do a cheap "repair" pass.
Apply guardrails – Domain-specific filters, safety checks / redaction
Log everything – Raw user query, routing choice + model used, tokens in/out, latency breakdown, cache hit status, context doc IDs / versions
Cache the result (if cacheable) – For FAQs and stable answers, store embeddings + answer + doc version

11.3 Architecture Diagram (Text Version)

[ Client / Frontend ]
        |
        v
[ API Gateway / Edge ]
        |
        v
[ Brain Service ]
   |      |        \
   |      |         \
[Router] [Semantic   [RAG Retrieval]
          Cache]           |
            \              |
             \             v
              \       [Context Builder]
               \          |
                \         v
                 --> [Model Backend Selector]
                          |
          +---------------+----------------+
          |                                |
          v                                v
 [Cheap Model Backend]             [Expensive Model Backend]
          |                                |
          +---------------+----------------+
                          |
                        [Post-Processor]
                          |
                      [Logging + Metrics]
                          |
                          v
                       [Response]

11.4 Concrete Routing Rules (Starting Point)

Start with simple but effective rules:

If task_type in {classification, tag_prediction, sentiment} → use cheap_model
If len(user_query) < 128 tokens and no RAG required → use cheap_model
If RAG required and sum(context_tokens) < 512 and task_type = faq_qna → try cheap_model, fallback to expensive_model if confidence low
If domain in {compliance, pricing, legalish} or user is enterprise_tier → default to expensive_model

Refine later with: data-driven router (train classifier on past success/fail), LLM-as-router ("Is this easy or hard?" with a small model).

11.5 Where Quantization Fits In

If self-hosting models:

Cheap backend: Aggressively quantized (int4/8, AWQ/GPTQ), tuned for throughput
Expensive backend: Larger model, maybe partially quantized, tuned more for quality than raw speed

Expose them behind a unified interface:

POST /llm-inference
{
  "model_tier": "cheap" | "expensive",
  "prompt": "...",
  "max_tokens": 256,
  "temperature": 0.2
}

The Brain doesn't care if it's GPTQ/AWQ/whatever under the hood — it just picks the tier.

11.6 Metrics You Absolutely Need

For each request, log at least:

request_id, user_id / tenant_id (or hashed)
task_type, model_tier (cheap / expensive), model_name
tokens_in, tokens_out
latency_total_ms, latency_model_ms, latency_retrieval_ms
cache_hit (true/false), rag_used (true/false)
num_llm_calls

Then build dashboards: cost per 1K requests by endpoint + model tier, P50/P90/P99 latency by endpoint + model tier, cache hit rate over time, % routed to cheap vs expensive models, quality proxy (LLM-as-judge on sample traffic).

That's how you turn all this into real unit economics data, not vibes.

11.7 How This Actually Reduces Latency & Cost

You get multiplicative gains because:

Routing keeps most traffic on the cheap/fast path
Semantic caching makes repeated queries nearly free
RAG token discipline keeps prompts small
Quantization boosts throughput and lowers infra cost
Reduced call count (simpler orchestration) cuts both tokens and latency

Nothing here is exotic. It's just a coherent design instead of a pile of ad-hoc hacks.