Calculator
AI Inference Benchmark & Cost
Benchmark inference speed and cost per million tokens across hardware (H100, A100, consumer GPUs) and models.
Pricing data refreshed:
The AITOT Inference Benchmark calculator estimates tokens/second and cost per million output tokens for self-hosted inference on H100, H200, B200, A100, and RTX 5090 — running Llama 4, Qwen 3, Mistral, DeepSeek, and other open-weight models with vLLM, TGI, or SGLang.
An H100 runs Llama 4 70B at ~95 tokens/sec single-stream, 380 tokens/sec at batch=8. With speculative decoding using Llama 4 8B as the draft model, single-stream pushes to ~140 tokens/sec. H100 is 1.7× faster than A100 on average; H200 cuts TTFT by 40% vs H100 due to higher memory bandwidth.
Cost per million output tokens drops dramatically with batching — single-stream H100 + Llama 4 70B is $0.45/M; batch=8 drops to $0.12/M (almost 4× cheaper). Use the calculator to see when self-hosting beats hosted APIs at your specific volume + batch concurrency.
Cheapest
DeepInfra
$69.00/mo
Fastest
SambaNova
580tok/s
| Host | Tokens/sec | TTFT | Response time | $ / 1M out | Total / mo |
|---|---|---|---|---|---|
| DeepInfra | 70 | 410 ms | 7.55 s | $0.60 | $69.00 |
| SambaNova | 580 | 110 ms | 0.97 s | $0.60 | $90.00 |
| Groq | 320 | 180 ms | 1.74 s | $0.79 | $98.50 |
| Cerebras | 450 | 120 ms | 1.23 s | $0.85 | $107.50 |
| Together | 92 | 320 ms | 5.75 s | $0.88 | $132.00 |
| Fireworks | 110 | 290 ms | 4.84 s | $0.90 | $135.00 |
| Self-host (H100 SXM ×4, vLLM) AWS p5 spot reference | 85 | 380 ms | 6.26 s | $1.95 | $292.50 |
| Self-host (B200 ×4) | 165 | 220 ms | 3.25 s | $2.10 | $315.00 |
Numbers are batch=1 streaming-decode (chat UX). Production back-end batches can hit 5–20× higher tokens/sec at the same per-token cost. Cross-check against artificialanalysis.ai for the latest.
What this calculator does
Tokens/sec for top models
Llama 4 8B/70B/405B, Qwen 3, DeepSeek V3, Mistral Large, GPT-OSS — all benchmarked.
Batch size modeling
See throughput scaling from batch=1 to batch=32 with continuous batching.
TTFT estimates
Time-to-first-token modeling — critical for chat UX.
Speculative decoding
Toggle to see 1.5–2× speedup with draft-model speculation.
Cost per 1M output
GPU rental cost ÷ throughput = real $/M output tokens.
vLLM, TGI, SGLang
Engine-specific overhead accounted for; vLLM is typically fastest for throughput.
Quick comparison
Llama 4 70B inference performance by GPU (vLLM, batch=8)
| GPU | Tokens/sec | TTFT | Cost/M out |
|---|---|---|---|
| RTX 5090 32GB (quant) | 110 | 420ms | $0.08 |
| A100 80GB | 210 | 180ms | $0.18 |
| H100 80GB | 380 | 95ms | $0.12 |
| H100 SXM + spec. | 540 | 90ms | $0.09 |
| H200 141GB | 480 | 60ms | $0.10 |
| B200 | 760 | 40ms | $0.08 |
Cost assumes RunPod community pricing; vLLM batched at 8 concurrent requests.
How to use this calculator
Estimate tokens/sec and cost per million tokens for self-hosted LLM inference.
- 1
Pick a model
Choose Llama 4 8B/70B/405B, Qwen 3, DeepSeek V3, or Mistral. The tool flags VRAM-incompatible GPU pairs.
- 2
Pick a GPU
H100 is the workhorse. H200 or B200 for highest throughput. RTX 5090 for cheap dev.
- 3
Set batch concurrency
Batch=8 is the typical production sweet spot. Higher batches save cost but raise latency.
- 4
Enable speculative decoding
If you have a small draft model, toggle on for 1.5–2× speedup at the same accuracy.
Why use this calculator
- ✓Benchmarks based on public vLLM + SGLang reports
- ✓5 GPU classes covered
- ✓Engine overhead included
- ✓TTFT modeled, not just throughput
- ✓Speculative decoding included
- ✓Refreshed monthly