AITOT

Calculator

AI Inference Benchmark & Cost

Benchmark inference speed and cost per million tokens across hardware (H100, A100, consumer GPUs) and models.

Pricing data refreshed:

The AITOT Inference Benchmark calculator estimates tokens/second and cost per million output tokens for self-hosted inference on H100, H200, B200, A100, and RTX 5090 — running Llama 4, Qwen 3, Mistral, DeepSeek, and other open-weight models with vLLM, TGI, or SGLang.

An H100 runs Llama 4 70B at ~95 tokens/sec single-stream, 380 tokens/sec at batch=8. With speculative decoding using Llama 4 8B as the draft model, single-stream pushes to ~140 tokens/sec. H100 is 1.7× faster than A100 on average; H200 cuts TTFT by 40% vs H100 due to higher memory bandwidth.

Cost per million output tokens drops dramatically with batching — single-stream H100 + Llama 4 70B is $0.45/M; batch=8 drops to $0.12/M (almost 4× cheaper). Use the calculator to see when self-hosting beats hosted APIs at your specific volume + batch concurrency.

Cheapest

DeepInfra

$69.00/mo

Fastest

SambaNova

580tok/s

HostTokens/secTTFTResponse time$ / 1M outTotal / mo
DeepInfra70410 ms7.55 s$0.60$69.00
SambaNova580110 ms0.97 s$0.60$90.00
Groq320180 ms1.74 s$0.79$98.50
Cerebras450120 ms1.23 s$0.85$107.50
Together92320 ms5.75 s$0.88$132.00
Fireworks110290 ms4.84 s$0.90$135.00
Self-host (H100 SXM ×4, vLLM)

AWS p5 spot reference

85380 ms6.26 s$1.95$292.50
Self-host (B200 ×4)165220 ms3.25 s$2.10$315.00

Numbers are batch=1 streaming-decode (chat UX). Production back-end batches can hit 5–20× higher tokens/sec at the same per-token cost. Cross-check against artificialanalysis.ai for the latest.

What this calculator does

Tokens/sec for top models

Llama 4 8B/70B/405B, Qwen 3, DeepSeek V3, Mistral Large, GPT-OSS — all benchmarked.

Batch size modeling

See throughput scaling from batch=1 to batch=32 with continuous batching.

TTFT estimates

Time-to-first-token modeling — critical for chat UX.

Speculative decoding

Toggle to see 1.5–2× speedup with draft-model speculation.

Cost per 1M output

GPU rental cost ÷ throughput = real $/M output tokens.

vLLM, TGI, SGLang

Engine-specific overhead accounted for; vLLM is typically fastest for throughput.

Quick comparison

Llama 4 70B inference performance by GPU (vLLM, batch=8)

GPUTokens/secTTFTCost/M out
RTX 5090 32GB (quant)110420ms$0.08
A100 80GB210180ms$0.18
H100 80GB38095ms$0.12
H100 SXM + spec.54090ms$0.09
H200 141GB48060ms$0.10
B20076040ms$0.08

Cost assumes RunPod community pricing; vLLM batched at 8 concurrent requests.

How to use this calculator

Estimate tokens/sec and cost per million tokens for self-hosted LLM inference.

  1. 1

    Pick a model

    Choose Llama 4 8B/70B/405B, Qwen 3, DeepSeek V3, or Mistral. The tool flags VRAM-incompatible GPU pairs.

  2. 2

    Pick a GPU

    H100 is the workhorse. H200 or B200 for highest throughput. RTX 5090 for cheap dev.

  3. 3

    Set batch concurrency

    Batch=8 is the typical production sweet spot. Higher batches save cost but raise latency.

  4. 4

    Enable speculative decoding

    If you have a small draft model, toggle on for 1.5–2× speedup at the same accuracy.

Why use this calculator

  • Benchmarks based on public vLLM + SGLang reports
  • 5 GPU classes covered
  • Engine overhead included
  • TTFT modeled, not just throughput
  • Speculative decoding included
  • Refreshed monthly

Frequently Asked Questions

How many tokens per second can an H100 run on Llama 4 70B?+
About 95 tokens/sec single-stream, 380 tokens/sec batched at 8 concurrent requests using vLLM. With speculative decoding via Llama 4 8B as the draft model, you can push 140 tokens/sec single-stream. Time-to-first-token typically 280ms cold, 95ms warm.
H100 vs A100 — what is the actual inference speedup in 2026?+
For Llama 4 70B FP16: H100 runs ~1.7× faster than A100 (95 vs 56 tok/sec). For long-context (>32k tokens), H100 widens the gap to 2.4× due to higher memory bandwidth (3.35TB/s vs 2.04TB/s). A100 still wins on $/token for legacy workloads.
What is TTFT and why does it matter?+
Time-to-first-token: how long the user waits before seeing the first character of the response. Matters most for chat UX. Above 1 second feels broken. Speculative decoding, prompt caching, and prefix sharing all reduce TTFT. H200 and B200 cut TTFT 40% vs H100.
How does batching affect cost per million tokens?+
Single-stream H100 + Llama 4 70B costs about $0.45/M output tokens. At batch=8, it drops to $0.12/M output (almost 4× cheaper). vLLM, TGI, and SGLang all support continuous batching. The calculator models batch=1, 4, 8, 16 scenarios.
Is consumer GPU (RTX 4090, 5090) viable for inference?+
For models up to 30B parameters quantized to int4, yes. RTX 5090 (32GB) runs Llama 4 8B at 180 tokens/sec for under $0.05/hour amortized electricity. Not viable for 70B+ without 4-bit quant + offloading. Cheap path for development and side projects.
What inference engine should I use in 2026 — vLLM, TGI, or SGLang?+
vLLM has the best continuous batching and prompt caching. SGLang wins for structured output and complex prompts. TGI is the most production-hardened (HF). For pure throughput, vLLM. For latency-sensitive chat, SGLang. The calculator assumes vLLM defaults.