AITOT

Calculator

RAG Total Cost Calculator

All-in-one RAG bill — embedding pass + vector DB + reranker + LLM generation. Plug in document count and query volume to see the full monthly stack.

Pricing data refreshed:

The AITOT RAG Total Cost calculator estimates monthly cost for a full RAG application stack — embedding (one-time + recurring), vector DB storage + queries, optional reranker, and LLM generation. Inputs: corpus size, chunks per doc, queries per day, chunks retrieved per query, generation tokens.

A typical knowledge-base RAG with 1M docs, 10k queries/day, and reranker on costs about $160/month: $40 vector DB + $30 reranker + $90 LLM generation. Generation dominates at high query volume; vector DB dominates at large corpus + low query. The calculator shows the split for your specific scale.

Toggle prompt caching to cut generation cost 50–90% — for stable system prompts (typical 4–8k tokens), real-world steady-state cache hit rate is 70–85%. Reranker on Cohere Rerank 3 at $1/1k searches improves answer quality 15–30% by re-scoring 50 retrieved chunks down to top-5.

Total monthly

$913

One-time embed cost

$6

Per query

$0.0061

Year 1 total

$10,956

Monthly cost breakdown

Embedding query (Voyage AI voyage-3)
0%$0
Re-embed refresh (0.25×/mo)
0%$2
Vector DB (Pinecone Serverless (s1))
0%$3
Reranker (Cohere Rerank 3)
33%$300
Generation (Anthropic Claude Haiku 4.5)
67%$608

RAG bill = embedding query + vector DB storage/retrieval + reranker (optional) + LLM generation. Generation typically dominates above 50k queries/day. At MVP scale, vector DB minimums dominate.

What this calculator does

Full RAG stack

Embedding + vector DB + reranker + generation all in one bill.

Per-component breakdown

See exactly which line item is the biggest contributor at your scale.

Reranker toggle

Cohere Rerank 3 modeling. Adds $0.001/query but improves answer quality 15–30%.

Prompt cache modeling

Stable system prompts get 70–85% cache hits — toggle to see real cost.

Per-query cost

Surfaces $ per RAG query — critical for unit economics and pricing the product.

Chunk strategy modeling

Toggle chunks per doc and chunks retrieved per query to optimize cost.

Quick comparison

Monthly RAG cost at 1M docs, 10k queries/day (typical knowledge-base app)

ComponentProviderMonthly
Embed (one-time amortized)OpenAI 3-small$5
Vector DB (10M chunks)Pinecone Serverless$40
Reranker (300k queries)Cohere Rerank 3$30
Generation (Sonnet 4.6)Anthropic$90
Generation w/ 70% cache hitAnthropic$28
Total with cache + rerank$103 / mo

Without prompt caching, generation alone is $90+. Cache is the single biggest lever.

How to use this calculator

Calculate full RAG stack monthly cost — embed + vector DB + reranker + generation.

  1. 1

    Enter corpus + chunks

    Documents × chunks per doc. Typical: 1 doc = 5–20 chunks at 500 tokens each.

  2. 2

    Set query volume

    Queries per day. Most production apps cache 30–50% of queries before reaching the LLM.

  3. 3

    Toggle reranker

    Cohere Rerank 3 adds $0.001/query but improves quality 15–30%. Usually worth it.

  4. 4

    Set prompt cache hit rate

    Stable system prompts hit 70–85%. Cuts generation cost 50–90% on Anthropic.

Why use this calculator

  • Full stack — not just LLM piece
  • Reranker toggle
  • Prompt cache modeling
  • Per-query unit economics
  • 9 vector DB + 22 LLM providers
  • No login required

Frequently Asked Questions

What does a typical RAG application cost per month in 2026?+
For 1M documents, 10k queries/day, with reranker on: about $40 vector DB + $30 reranker + $90 LLM generation = $160/month total. Add $15 one-time embedding pass for the corpus. Without reranker, drop to $130/month. The calculator builds this stack-by-stack.
How do I split RAG cost between embedding, vector DB, and generation?+
For a typical knowledge-base RAG: embedding 5% one-time, vector DB 25% recurring, generation 60% recurring, reranker 10% if used. Generation dominates at high query volume; vector DB dominates at large corpus + low query. The calculator shows the split for your specific scale.
Should I use a reranker in my RAG pipeline?+
Yes if precision matters more than 200ms latency. Cohere Rerank 3 at $1/1k searches typically improves answer quality 15–30% by re-scoring 50 retrieved chunks down to top-5. For chat UX, the latency tax is worth it. For batch RAG (overnight reports), always rerank.
How many chunks should I retrieve per RAG query?+
Retrieve 20–50 chunks, rerank down to 5–10, pass to the LLM. Retrieving fewer than 10 chunks risks missing the answer; passing more than 10 to the LLM bloats input cost and dilutes attention. The calculator multiplies retrieved-chunks × tokens-per-chunk into generation cost.
Does prompt caching help RAG cost a lot?+
Massively. If your system prompt + few-shot examples are stable (typical 4–8k tokens), prompt cache hits cut Anthropic input cost 90%, OpenAI 50%, Google 75%. Real-world steady-state RAG cache hit rate is 70–85%. Toggle the slider to see your bill drop.
When is RAG cheaper than fine-tuning?+
Below 10M monthly tokens or when knowledge changes weekly, RAG wins. Above 50M tokens with stable knowledge that fits the prompt, fine-tuning a smaller model often beats RAG total cost by 2–5×. Most production apps stay on RAG due to operational simplicity.