Cost & FinOps Estimator

GenAI economics behave differently from cloud-compute economics, and most FinOps frameworks were built for the latter. Drag the sliders and watch token economics, caching, and model routing shape the bill. The numbers update as fast as you can move the controls. When you're done, optionally generate an LLM-written executive summary with action items.

// Use case (optional — shown in the LLM summary)

// Inputs

Monthly query volume 100,000

Log scale · 1k → 100M queries / month

Avg input tokens / call 1,200 tok

Includes system prompt + retrieved context

Avg output tokens / call 400 tok

Completion size after structure / formatting

Cache hit rate 30%

Identical-input cache; semantic cache adds more

// Model tier mix · must total 100%

Frontier 20%

Workhorse 60%

Open-weight 20%

// Model rates (USD per 1M tokens · click to edit)

tier input ($/1M) output ($/1M) Frontier Workhorse Open-weight

Defaults reflect public Q1 2026 list pricing for OpenAI/Anthropic frontier models, Kimi K2 / Claude Haiku class workhorses, and self-hosted Llama-3-class open-weight models on a managed inference stack at ~60% utilization.

// Live output

MONTHLY COST

— · — / call

ANNUAL

PER 1K CALLS

EFFECTIVE TOKENS / MO

CACHE SAVINGS / YR

// Spend by model tier

Frontier

Workhorse

Open-weight

CACHING IMPACT

—

SELF-HOSTING BREAK-EVEN

—

// Sensitivity · monthly cost at varying cache rate × frontier-mix

	10% frontier	30% frontier	60% frontier
0% cache	—	—	—
30% cache	—	—	—
60% cache	—	—	—

One LLM call synthesizes an executive summary from the current numbers.

// FinOps recommendations

Click the button above to have the model interpret your numbers and write an executive summary.

How it's calculated

For each model tier t in {frontier, workhorse, open}:
   tier_calls = monthly_volume × mix[t]
   per_call_cost[t] = (input_tokens × rate_in[t] + output_tokens × rate_out[t]) / 1,000,000
   tier_monthly_cost = tier_calls × per_call_cost[t] × (1 − cache_hit_rate)

monthly_total = Σ tier_monthly_cost
annual_total = monthly_total × 12

cache_savings_per_year = annual_total × cache_hit_rate / (1 − cache_hit_rate)
   (i.e. what you'd be paying if cache were disabled, minus what you pay now)

break_even_volume:
   self_host_monthly = $2,400 / GPU * gpu_count   (assuming a single A100/H100 class
                                                    GPU at ~50,000 throughput-tokens/sec
                                                    × 60% utilization ≈ ~80M tokens/month)
   solve: monthly_volume_at_which (cloud_per_call × volume) = self_host_monthly

All math runs in the browser — no upstream model call for the calculator. The LLM is only called when you press Generate FinOps recommendations; it receives the current numbers and writes a paragraph or two.

Why these levers

Volume — the multiplier. Doubling traffic doubles cost (no economies of scale on per-token pricing); halving traffic via batching or scheduling halves it.
Input tokens — system prompt + retrieved context. Often the biggest controllable lever: trim system prompts, return smaller chunks from retrieval, compress conversation history.
Output tokens — completion size. Constrain via response_format=json_object, max_tokens, structured output schemas. Verbose prose is more expensive than JSON.
Cache hit rate — usually the highest-leverage lever in real production traffic. Identical inputs return for free; semantic caching extends this further with care.
Model tier mix — most enterprise queries don't need the frontier model. Routing simple cases to a workhorse or open-weight model preserves quality where it matters and cuts cost everywhere else.

Honest caveat

This is a planning tool, not a billing system. Real bills include: long-context price tiers, output-tokens rate variations on some providers, dedicated-throughput discounts, prompt-caching credits (Anthropic, OpenAI), batch-API discounts (~50%), egress, observability and gateway costs.

Use the numbers here to scope conversations with vendors and finance, not to commit to a number. The model rate inputs are editable so you can override with your actual contract pricing.