Question 1

How many tokens per second can an H100 run on Llama 4 70B?

Accepted Answer

About 95 tokens/sec single-stream, 380 tokens/sec batched at 8 concurrent requests using vLLM. With speculative decoding via Llama 4 8B as the draft model, you can push 140 tokens/sec single-stream. Time-to-first-token typically 280ms cold, 95ms warm.

Question 2

H100 vs A100 — what is the actual inference speedup in 2026?

Accepted Answer

For Llama 4 70B FP16: H100 runs ~1.7× faster than A100 (95 vs 56 tok/sec). For long-context (>32k tokens), H100 widens the gap to 2.4× due to higher memory bandwidth (3.35TB/s vs 2.04TB/s). A100 still wins on $/token for legacy workloads.

Question 3

What is TTFT and why does it matter?

Accepted Answer

Time-to-first-token: how long the user waits before seeing the first character of the response. Matters most for chat UX. Above 1 second feels broken. Speculative decoding, prompt caching, and prefix sharing all reduce TTFT. H200 and B200 cut TTFT 40% vs H100.

Question 4

How does batching affect cost per million tokens?

Accepted Answer

Single-stream H100 + Llama 4 70B costs about $0.45/M output tokens. At batch=8, it drops to $0.12/M output (almost 4× cheaper). vLLM, TGI, and SGLang all support continuous batching. The calculator models batch=1, 4, 8, 16 scenarios.

Question 5

Is consumer GPU (RTX 4090, 5090) viable for inference?

Accepted Answer

For models up to 30B parameters quantized to int4, yes. RTX 5090 (32GB) runs Llama 4 8B at 180 tokens/sec for under $0.05/hour amortized electricity. Not viable for 70B+ without 4-bit quant + offloading. Cheap path for development and side projects.

Question 6

What inference engine should I use in 2026 — vLLM, TGI, or SGLang?

Accepted Answer

vLLM has the best continuous batching and prompt caching. SGLang wins for structured output and complex prompts. TGI is the most production-hardened (HF). For pure throughput, vLLM. For latency-sensitive chat, SGLang. The calculator assumes vLLM defaults.

Host	Tokens/sec	TTFT	Response time	$ / 1M out	Total / mo
DeepInfra	70	410 ms	7.55 s	$0.60	$69.00
SambaNova	580	110 ms	0.97 s	$0.60	$90.00
Groq	320	180 ms	1.74 s	$0.79	$98.50
Cerebras	450	120 ms	1.23 s	$0.85	$107.50
Together	92	320 ms	5.75 s	$0.88	$132.00
Fireworks	110	290 ms	4.84 s	$0.90	$135.00
Self-host (H100 SXM ×4, vLLM) AWS p5 spot reference	85	380 ms	6.26 s	$1.95	$292.50
Self-host (B200 ×4)	165	220 ms	3.25 s	$2.10	$315.00

GPU	Tokens/sec	TTFT	Cost/M out
RTX 5090 32GB (quant)	110	420ms	$0.08
A100 80GB	210	180ms	$0.18
H100 80GB	380	95ms	$0.12
H100 SXM + spec.	540	90ms	$0.09
H200 141GB	480	60ms	$0.10
B200	760	40ms	$0.08

AI Inference Benchmark & Cost

What this calculator does

Tokens/sec for top models

Batch size modeling

TTFT estimates

Speculative decoding

Cost per 1M output

vLLM, TGI, SGLang

Quick comparison

How to use this calculator

Why use this calculator

Frequently Asked Questions