Most LLM Inference Bills Are 5x Too High. Here's What to Do About It.

By Daniil Tiggemann | March 26, 2026

TL;DR

Inference now accounts for two-thirds of all AI compute costs — and 80–90% of a system's lifetime spend
Most teams run LLMs on H100s when L40S or quantized A100s work just as well at 2–5x lower cost
The ICML 2025 paper on heterogeneous GPU serving shows a 25–41% throughput gain at the same budget
Continuous batching alone improves throughput by 8–23x over naive serving
Model quantization (INT8/INT4) cuts memory use by 46–75%, opening up much cheaper GPU classes
Semantic caching and model routing can eliminate 50–86% of costs for repetitive workloads

Your LLM inference bill should be 60–80% lower than it is.

That's not a guess. It's what the numbers show. One enterprise went from $100,000 to $45,000 a month. Checkr cut costs by 5x. Care Access reduced their bill by 86%.

They all used the same playbook. This guide walks you through it.

What Is Heterogeneous GPU Serving?

Most teams run a single GPU type across their entire inference cluster. They pick H100s, rent a fleet, and run everything through them.

That's expensive. And it ignores how LLM workloads actually behave.

Different requests have very different demands. A summarization task might send 2,000 tokens in and get 20 back. A chatbot might send 300 tokens in and generate 500. These two workloads are almost opposites in what they need from hardware.

Heterogeneous GPU serving means using multiple GPU types in the same cluster — matching each request type to the hardware it actually needs.

The landmark paper on this is "Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs" (Jiang et al., ICML 2025). It studied three levers:

GPU composition — which GPU types to rent, and how many of each
Deployment configuration — how to partition the model across GPUs
Workload assignment — which requests go to which GPU pool

Disabling any single lever causes throughput to drop by 27–34%. All three working together beat Helix — the previous state of the art — by 25–35% on the same $15/hour budget.

The best part: you don't need to buy new hardware. It's a scheduling and configuration problem, not a procurement one.

The GPU Landscape: What Things Actually Cost

Here's what you're working with in 2026:

GPU	VRAM	On-Demand Cost	Best For
NVIDIA T4	16GB	$0.35–0.50/hr	7B models, budget workloads
NVIDIA L4	24GB	$0.50–0.80/hr	7B–14B models, efficient inference
NVIDIA L40S	48GB	~$1.80/hr	30B–40B models, mid-tier production
NVIDIA A100 40GB	40GB	~$1.42/hr	Large model inference
NVIDIA A100 80GB	80GB	~$1.76/hr	70B models in FP16
NVIDIA H100 SXM	80GB	$2.74–3.90/hr	Frontier models, ultra-low latency
NVIDIA H200	141GB	~$5.00/hr	Very long context, 70B+ single GPU

Notice the gap. An H100 costs 7x more than a T4. For a quantized 7B model, an L4 delivers latency within 20–30% of H100 — at one-fifth the price.

H100 prices have also dropped fast. They fell 64–75% from their peak. Hyperbolic now offers H100 at $1.49/hr. But even at that rate, cheaper GPUs win for most workloads.

The rule: Only use H100 when you need sub-100ms latency or are running unquantized 70B+ models. Everything else can run on A100, L40S, or lower.

5 Techniques That Actually Reduce Costs

1. Model Quantization

Quantization reduces the numerical precision of model weights. Instead of 16-bit float (FP16), you use 8-bit integer (INT8) or 4-bit integer (INT4).

The result: smaller models that fit on cheaper hardware.

Precision	Memory Reduction	Cost Impact
FP16 → INT8	~50%	2–4x cheaper hardware
FP16 → INT4	~75%	4–8x cheaper hardware
FP16 → FP8	~50%	1.5x throughput gain (H100+)

LLaMA-70B in BF16 needs two A100 80GB GPUs (~$3.50/hr combined). In INT4, it runs on a single A100 40GB (~$1.42/hr). That's a 59% cost cut with one config change.

There's also a capacity angle. On an H100, FP16 allows about 4 concurrent users at 4K context. INT4 frees enough memory for 47 users — a 12x increase in serving capacity from the same GPU.

Accuracy tradeoff: INT8 is near-lossless for most tasks. INT4 degrades code generation noticeably (about 8 points on HumanEval). For math, knowledge, and chat tasks, degradation is minimal.

Tools: AWQ, GPTQ, FP8 (H100+), bitsandbytes (INT8), GGUF (CPU deployment).

2. Continuous Batching

Naive serving processes one request at a time. GPU utilization: 20–40%.

Static batching groups requests, but waits for the entire batch to finish before starting new ones. Still inefficient.

Continuous batching inserts new requests as slots open. GPU utilization jumps to 60–85%.

Anyscale measured a 23x throughput improvement using continuous batching with optimized memory management. Continuous batching alone gives an 8x improvement over naive serving.

For per-token costs: moving from single requests to batches of 32 cuts cost by about 85% with only 20% additional latency.

Tools: vLLM (most widely deployed), SGLang (fastest for shared-prefix workloads), HuggingFace TGI.

3. Choose the Right Inference Framework

Not all frameworks perform equally. Here's how they compare on H100 at 50 concurrent requests:

Framework	Throughput	Best For
vLLM	1,850 tok/s	High-concurrency, broad model support
TensorRT-LLM	2,100 tok/s	Maximum throughput, NVIDIA-only
SGLang	1,920 tok/s	RAG, agentic, multi-turn workloads

For throughput-critical workloads: SGLang delivers ~16,200 tokens/second vs. vLLM's ~12,500. That 29% difference translates to roughly $15,000 in monthly GPU savings at a million requests per day.

At extreme concurrency (100 concurrent requests), vLLM scales better: 4,741 tok/s vs SGLang's 3,221.

Rule of thumb: Use SGLang for RAG pipelines and multi-turn chat. Use vLLM for high-concurrency production systems.

4. Spot Instances for Batch Jobs

Cloud spot and preemptible GPU instances offer 60–90% discounts vs on-demand.

In AWS eu-north-1, H100 Spot pricing fell from $105.20/hr in January 2024 to $12.16/hr by September 2025 — an 88% price collapse.

The key: spot instances are for non-interactive work. Training, fine-tuning, evaluations, and offline batch jobs are all good candidates. Don't run real-time inference on spot — an interruption drops requests.

Reserved instances also help for predictable production load. One-year commitments save 30–60% vs on-demand.

5. Semantic Caching and Model Routing

About 31% of enterprise LLM queries are semantically similar to previous ones. Semantic caching detects near-duplicate queries and serves cached responses — no inference needed.

Care Access implemented prompt caching on Amazon Bedrock for medical records. Result: 86% cost reduction, 66% faster processing.

Model routing is the other half. Route simple queries to a cheap 7B model ($0.06/M tokens). Save the expensive 70B for complex tasks. One team cut their monthly bill from $48,000 to $28,000 — a 42% reduction with no quality change.

Together, caching and routing can eliminate 50–80% of costs on workloads with repetitive patterns.

7 Companies That Cut LLM Costs — And by How Much

Salesforce AI Research: Switched inference to Together AI. Result: 2x latency reduction, ~33% cost reduction.

Cursor: Serves 400M+ daily code completions via Together AI. Achieved ~30% cost savings with 2x latency improvement. Quantization applied without accuracy loss for coding.

Convirza: Moved from Longformer to fine-tuned Llama-3-8B via Predibase multi-LoRA. Result: 10x cost reduction vs OpenAI, 80% throughput increase, 8% F1 improvement.

Checkr: Fine-tuned Llama-3-8B for background check classification. Result: 5x cost reduction vs GPT-4, 30x speedup, 90% accuracy on hard cases.

Care Access: Applied prompt caching for medical records. Result: 86% cost reduction, 66% faster processing.

Anyscale vs Amazon Bedrock: Llama 3.1 8B FP8 on Anyscale cost 2.9x less than Bedrock. Llama 3.1 70B FP8 with 80% shared prefix was 22% cheaper than Bedrock.

Enterprise baseline (anonymous): Applied quantization + autoscaling + caching together. Monthly bill: $100,000 → $45,000. A 55% reduction with no quality change.

The Decision Framework

Use this to pick the right setup for your workload.

Step 1: Classify your workload

Workload	Input	Output	Best GPU
Summarization / RAG	Long (2K+)	Short (<50)	H100 or L40S for prefill
Chatbot / conversational	Short–Medium	Long (100–500)	A100 for decode
Code completion	Medium	Medium	A100 or L40S
Offline batch processing	Any	Any	Spot + T4/L4 + quantization
Ultra-low latency (<100ms)	Short	Short	H100 or Groq LPU

Step 2: Match model size to GPU

Model	Precision	Minimum GPU	Cost-Optimal GPU
7B	BF16	T4 (16GB)	L4 (24GB)
7B	INT4	CPU	T4
13B	BF16	2x T4	A10G
70B	BF16	2x A100 80GB	1x A100 80GB (INT4)
70B	INT4	1x A100 40GB	1x A100 40GB

Step 3: Ask the right questions

Is GPU utilization below 70%? You're on the wrong tier. Downgrade.

Do you need sub-200ms time-to-first-token? Use H100 or Groq.

Is traffic bursty? Use serverless (Together AI, Fireworks, Modal) or autoscaling.

Do you have batch jobs running on on-demand instances? Move them to Spot immediately.

KPIs to Track

Metric	What it Tells You	Target
Cost per million tokens	Primary unit economics	$0.06–$2.00 for open models
GPU utilization	Are you wasting capacity?	>70%
Time to first token (TTFT)	Perceived latency	<200ms for interactive apps
Tokens per second	GPU productivity	>1,000 on H100 with batching
Requests per GPU-dollar	Overall efficiency	Benchmark across configs

If GPU utilization is below 40%, you're burning 60% of your GPU budget. Fix that first.

Common Pitfalls

Over-relying on H100. Most models under 70B don't need it. An L40S or quantized A100 delivers 70–80% of H100 throughput at 35–50% of the cost.
Ignoring idle time. At startups, 30–50% of GPU costs come from instances left running idle. Autoscaling is not optional.
Quantizing without testing. INT4 hurts code generation (8 points on HumanEval). It's near-lossless for math and chat. Always test on your specific task first.
Using static batching in production. Any system still on static batching is leaving 70–80% of throughput behind. Switch to vLLM or SGLang now.
Locking to one cloud region. GPU spot prices vary 2–5x across regions. Automation tools like Cast.AI handle regional arbitrage and save significant money.

The 3-Phase Implementation Plan

Phase 1: Quick Wins

Apply quantization. Switch to AWQ or FP8. One config change. Expect 60–75% VRAM reduction.
Enable continuous batching. Deploy vLLM or SGLang. Expect 8–23x throughput improvement.
Audit GPU utilization. Use nvidia-smi. If compute or memory is below 60%, downgrade your GPU tier.

Phase 2: Infrastructure Optimization

Add semantic caching. Target the ~31% of repeated queries.
Implement model routing. Send simple tasks to 7B–8B models.
Move batch jobs to Spot. Use Anyscale or Together Batch API — 50% cheaper.

Phase 3: Advanced Architecture

Adopt heterogeneous GPU clusters. Optimize composition, configuration, and workload routing together. Expect 25–41% throughput gains at the same budget.
Add prefill-decode disaggregation. Route compute-heavy prefill to H100, memory-heavy decode to A100/L40S.
Evaluate reserved capacity. One-year commitments save 30–60% for steady-state production.

Frequently Asked Questions

Do I need to buy different GPUs to implement heterogeneous serving?

No. You use cloud instances with different GPU types. Configure your serving system to route requests to the right pool. vLLM and SGLang both support multi-GPU configurations that make this manageable.

How much accuracy do I lose with INT4 quantization?

It depends on the task. For math and knowledge tasks, the loss is minimal — 0.5–1.5 points on benchmarks. For code generation, expect about an 8-point drop on HumanEval. Always benchmark your specific use case before deploying.

What's the fastest way to cut my LLM inference bill right now?

Apply quantization (AWQ or INT8) and switch to continuous batching via vLLM. These two changes alone can cut costs by 60–85% within a week.

Should I use a managed provider like Together AI or Fireworks, or self-host?

For under ~100M tokens/month, managed providers are usually cheaper — no GPU rental, no engineering overhead. Above that, self-hosting with quantization and vLLM typically beats provider pricing by 2–3x.

What is time to first token and why does it matter?

Time to first token (TTFT) is how long it takes to receive the first output token after sending a request. For interactive apps, this is what users feel as lag. Target under 200ms for chat. For batch jobs, TTFT doesn't matter — throughput does.