Your LLM inference bill should be 60–80% lower than it is.
That's not a guess. It's what the numbers show. One enterprise went from $100,000 to $45,000 a month. Checkr cut costs by 5x. Care Access reduced their bill by 86%.
They all used the same playbook. This guide walks you through it.
What Is Heterogeneous GPU Serving?
Most teams run a single GPU type across their entire inference cluster. They pick H100s, rent a fleet, and run everything through them.
That's expensive. And it ignores how LLM workloads actually behave.
Different requests have very different demands. A summarization task might send 2,000 tokens in and get 20 back. A chatbot might send 300 tokens in and generate 500. These two workloads are almost opposites in what they need from hardware.
Heterogeneous GPU serving means using multiple GPU types in the same cluster — matching each request type to the hardware it actually needs.
The landmark paper on this is "Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs" (Jiang et al., ICML 2025). It studied three levers:
- GPU composition — which GPU types to rent, and how many of each
- Deployment configuration — how to partition the model across GPUs
- Workload assignment — which requests go to which GPU pool
Disabling any single lever causes throughput to drop by 27–34%. All three working together beat Helix — the previous state of the art — by 25–35% on the same $15/hour budget.
The best part: you don't need to buy new hardware. It's a scheduling and configuration problem, not a procurement one.
The GPU Landscape: What Things Actually Cost
Here's what you're working with in 2026:
| GPU | VRAM | On-Demand Cost | Best For |
|---|---|---|---|
| NVIDIA T4 | 16GB | $0.35–0.50/hr | 7B models, budget workloads |
| NVIDIA L4 | 24GB | $0.50–0.80/hr | 7B–14B models, efficient inference |
| NVIDIA L40S | 48GB | ~$1.80/hr | 30B–40B models, mid-tier production |
| NVIDIA A100 40GB | 40GB | ~$1.42/hr | Large model inference |
| NVIDIA A100 80GB | 80GB | ~$1.76/hr | 70B models in FP16 |
| NVIDIA H100 SXM | 80GB | $2.74–3.90/hr | Frontier models, ultra-low latency |
| NVIDIA H200 | 141GB | ~$5.00/hr | Very long context, 70B+ single GPU |
Notice the gap. An H100 costs 7x more than a T4. For a quantized 7B model, an L4 delivers latency within 20–30% of H100 — at one-fifth the price.
H100 prices have also dropped fast. They fell 64–75% from their peak. Hyperbolic now offers H100 at $1.49/hr. But even at that rate, cheaper GPUs win for most workloads.
The rule: Only use H100 when you need sub-100ms latency or are running unquantized 70B+ models. Everything else can run on A100, L40S, or lower.
5 Techniques That Actually Reduce Costs
1. Model Quantization
Quantization reduces the numerical precision of model weights. Instead of 16-bit float (FP16), you use 8-bit integer (INT8) or 4-bit integer (INT4).
The result: smaller models that fit on cheaper hardware.
| Precision | Memory Reduction | Cost Impact |
|---|---|---|
| FP16 → INT8 | ~50% | 2–4x cheaper hardware |
| FP16 → INT4 | ~75% | 4–8x cheaper hardware |
| FP16 → FP8 | ~50% | 1.5x throughput gain (H100+) |
LLaMA-70B in BF16 needs two A100 80GB GPUs (~$3.50/hr combined). In INT4, it runs on a single A100 40GB (~$1.42/hr). That's a 59% cost cut with one config change.
There's also a capacity angle. On an H100, FP16 allows about 4 concurrent users at 4K context. INT4 frees enough memory for 47 users — a 12x increase in serving capacity from the same GPU.
Accuracy tradeoff: INT8 is near-lossless for most tasks. INT4 degrades code generation noticeably (about 8 points on HumanEval). For math, knowledge, and chat tasks, degradation is minimal.
Tools: AWQ, GPTQ, FP8 (H100+), bitsandbytes (INT8), GGUF (CPU deployment).
2. Continuous Batching
Naive serving processes one request at a time. GPU utilization: 20–40%.
Static batching groups requests, but waits for the entire batch to finish before starting new ones. Still inefficient.
Continuous batching inserts new requests as slots open. GPU utilization jumps to 60–85%.
Anyscale measured a 23x throughput improvement using continuous batching with optimized memory management. Continuous batching alone gives an 8x improvement over naive serving.
For per-token costs: moving from single requests to batches of 32 cuts cost by about 85% with only 20% additional latency.
Tools: vLLM (most widely deployed), SGLang (fastest for shared-prefix workloads), HuggingFace TGI.
3. Choose the Right Inference Framework
Not all frameworks perform equally. Here's how they compare on H100 at 50 concurrent requests:
| Framework | Throughput | Best For |
|---|---|---|
| vLLM | 1,850 tok/s | High-concurrency, broad model support |
| TensorRT-LLM | 2,100 tok/s | Maximum throughput, NVIDIA-only |
| SGLang | 1,920 tok/s | RAG, agentic, multi-turn workloads |
For throughput-critical workloads: SGLang delivers ~16,200 tokens/second vs. vLLM's ~12,500. That 29% difference translates to roughly $15,000 in monthly GPU savings at a million requests per day.
At extreme concurrency (100 concurrent requests), vLLM scales better: 4,741 tok/s vs SGLang's 3,221.
Rule of thumb: Use SGLang for RAG pipelines and multi-turn chat. Use vLLM for high-concurrency production systems.
4. Spot Instances for Batch Jobs
Cloud spot and preemptible GPU instances offer 60–90% discounts vs on-demand.
In AWS eu-north-1, H100 Spot pricing fell from $105.20/hr in January 2024 to $12.16/hr by September 2025 — an 88% price collapse.
The key: spot instances are for non-interactive work. Training, fine-tuning, evaluations, and offline batch jobs are all good candidates. Don't run real-time inference on spot — an interruption drops requests.
Reserved instances also help for predictable production load. One-year commitments save 30–60% vs on-demand.
5. Semantic Caching and Model Routing
About 31% of enterprise LLM queries are semantically similar to previous ones. Semantic caching detects near-duplicate queries and serves cached responses — no inference needed.
Care Access implemented prompt caching on Amazon Bedrock for medical records. Result: 86% cost reduction, 66% faster processing.
Model routing is the other half. Route simple queries to a cheap 7B model ($0.06/M tokens). Save the expensive 70B for complex tasks. One team cut their monthly bill from $48,000 to $28,000 — a 42% reduction with no quality change.
Together, caching and routing can eliminate 50–80% of costs on workloads with repetitive patterns.
7 Companies That Cut LLM Costs — And by How Much
Salesforce AI Research: Switched inference to Together AI. Result: 2x latency reduction, ~33% cost reduction.
Cursor: Serves 400M+ daily code completions via Together AI. Achieved ~30% cost savings with 2x latency improvement. Quantization applied without accuracy loss for coding.
Convirza: Moved from Longformer to fine-tuned Llama-3-8B via Predibase multi-LoRA. Result: 10x cost reduction vs OpenAI, 80% throughput increase, 8% F1 improvement.
Checkr: Fine-tuned Llama-3-8B for background check classification. Result: 5x cost reduction vs GPT-4, 30x speedup, 90% accuracy on hard cases.
Care Access: Applied prompt caching for medical records. Result: 86% cost reduction, 66% faster processing.
Anyscale vs Amazon Bedrock: Llama 3.1 8B FP8 on Anyscale cost 2.9x less than Bedrock. Llama 3.1 70B FP8 with 80% shared prefix was 22% cheaper than Bedrock.
Enterprise baseline (anonymous): Applied quantization + autoscaling + caching together. Monthly bill: $100,000 → $45,000. A 55% reduction with no quality change.
The Decision Framework
Use this to pick the right setup for your workload.
Step 1: Classify your workload
| Workload | Input | Output | Best GPU |
|---|---|---|---|
| Summarization / RAG | Long (2K+) | Short (<50) | H100 or L40S for prefill |
| Chatbot / conversational | Short–Medium | Long (100–500) | A100 for decode |
| Code completion | Medium | Medium | A100 or L40S |
| Offline batch processing | Any | Any | Spot + T4/L4 + quantization |
| Ultra-low latency (<100ms) | Short | Short | H100 or Groq LPU |
Step 2: Match model size to GPU
| Model | Precision | Minimum GPU | Cost-Optimal GPU |
|---|---|---|---|
| 7B | BF16 | T4 (16GB) | L4 (24GB) |
| 7B | INT4 | CPU | T4 |
| 13B | BF16 | 2x T4 | A10G |
| 70B | BF16 | 2x A100 80GB | 1x A100 80GB (INT4) |
| 70B | INT4 | 1x A100 40GB | 1x A100 40GB |
Step 3: Ask the right questions
Is GPU utilization below 70%? You're on the wrong tier. Downgrade.
Do you need sub-200ms time-to-first-token? Use H100 or Groq.
Is traffic bursty? Use serverless (Together AI, Fireworks, Modal) or autoscaling.
Do you have batch jobs running on on-demand instances? Move them to Spot immediately.
KPIs to Track
| Metric | What it Tells You | Target |
|---|---|---|
| Cost per million tokens | Primary unit economics | $0.06–$2.00 for open models |
| GPU utilization | Are you wasting capacity? | >70% |
| Time to first token (TTFT) | Perceived latency | <200ms for interactive apps |
| Tokens per second | GPU productivity | >1,000 on H100 with batching |
| Requests per GPU-dollar | Overall efficiency | Benchmark across configs |
If GPU utilization is below 40%, you're burning 60% of your GPU budget. Fix that first.
Common Pitfalls
- Over-relying on H100. Most models under 70B don't need it. An L40S or quantized A100 delivers 70–80% of H100 throughput at 35–50% of the cost.
- Ignoring idle time. At startups, 30–50% of GPU costs come from instances left running idle. Autoscaling is not optional.
- Quantizing without testing. INT4 hurts code generation (8 points on HumanEval). It's near-lossless for math and chat. Always test on your specific task first.
- Using static batching in production. Any system still on static batching is leaving 70–80% of throughput behind. Switch to vLLM or SGLang now.
- Locking to one cloud region. GPU spot prices vary 2–5x across regions. Automation tools like Cast.AI handle regional arbitrage and save significant money.
The 3-Phase Implementation Plan
Phase 1: Quick Wins
- Apply quantization. Switch to AWQ or FP8. One config change. Expect 60–75% VRAM reduction.
- Enable continuous batching. Deploy vLLM or SGLang. Expect 8–23x throughput improvement.
- Audit GPU utilization. Use nvidia-smi. If compute or memory is below 60%, downgrade your GPU tier.
Phase 2: Infrastructure Optimization
- Add semantic caching. Target the ~31% of repeated queries.
- Implement model routing. Send simple tasks to 7B–8B models.
- Move batch jobs to Spot. Use Anyscale or Together Batch API — 50% cheaper.
Phase 3: Advanced Architecture
- Adopt heterogeneous GPU clusters. Optimize composition, configuration, and workload routing together. Expect 25–41% throughput gains at the same budget.
- Add prefill-decode disaggregation. Route compute-heavy prefill to H100, memory-heavy decode to A100/L40S.
- Evaluate reserved capacity. One-year commitments save 30–60% for steady-state production.