← All posts
aicloudcostfinopsinference

Cloud cost in 2026: why your AI workloads are ten times your compute bill

Inference is the dominant line on the cloud invoice now. GPU egress is real. Reserved capacity for AI is harder to model than EC2 ever was. A worked example.

25 April 2026·4 min read

In Q1 2026, across six SaaS clients I run cost reviews for, the median ratio of AI-inference spend to traditional compute is 8.4x. The highest is 17x. A year ago the median was under 2x. Compute is not getting cheaper; AI is getting bigger.

This is the practical map of where the money goes and what to do about it.

The new line items

A 2024 cloud bill had a predictable shape: compute, storage, network, data transfer, managed databases. The 2026 bill adds:

  • Hosted LLM inference (Bedrock, Azure OpenAI, Vertex, plus direct provider APIs).
  • Embedding generation, often a separate line at high volume.
  • Vector storage (managed: pgvector on RDS, OpenSearch, Pinecone passthrough).
  • Evaluation runs, which are LLM calls but usually mis-tagged.
  • GPU compute for self-hosted models, where applicable.
  • Egress on AI traffic, which is non-trivial when you stream long completions across regions.

In a typical bill the first item alone is now larger than EC2.

Rough $/1M token costs, April 2026

Public list prices. Caching, batch, and committed-use discounts can move these 30–70%.

| Provider / model | $/1M input | $/1M output | | -------------------------- | ---------- | ----------- | | Gemini 2 Flash | $0.10 | $0.40 | | Amazon Nova Lite | $0.06 | $0.24 | | GPT-4o mini | $0.15 | $0.60 | | Claude 4 Haiku | $0.25 | $1.25 | | Gemini 2 Pro | $1.25 | $5.00 | | GPT-4.x (frontier tier) | $2.50 | $10.00 | | Claude 4 Sonnet | $3.00 | $15.00 | | Claude 4 Opus | $15.00 | $75.00 |

The frontier-class models are roughly 25–50x more expensive than the workhorse tier. Most production traffic does not need frontier; routing it there anyway is the most common cost mistake I see.

Worked example: a 500M-token-per-month SaaS

Assume a B2B SaaS with 500M tokens/month, 70% input / 30% output, and a tiered routing strategy:

  • 80% of calls go to a Flash-class model (Gemini 2 Flash or Nova Lite).
  • 18% go to a mid-tier model (Claude 4 Haiku or GPT-4o mini).
  • 2% go to a frontier model (Claude 4 Sonnet) for hard cases.

Math:

  • Flash tier: 400M tokens. 280M input $0.10 + 120M output $0.40 = $28 + $48 = $76.
  • Mid tier: 90M tokens. 63M input $0.25 + 27M output $1.25 = $15.75 + $33.75 = $49.50.
  • Frontier tier: 10M tokens. 7M input $3.00 + 3M output $15.00 = $21 + $45 = $66.

Total LLM spend: ~$192/month. Add embeddings (say 50M tokens at $0.02/1M = $1), vector storage ($150 on managed pgvector), evaluation runs ($60), and egress ($40). Roughly $440/month.

Now do the naive version: route every call to Claude 4 Sonnet. 350M input $3 + 150M output $15 = $1,050 + $2,250 = $3,300/month. Same product, 7.5x the bill.

This is why model routing is not optional. A cheap router that classifies request difficulty and dispatches accordingly typically pays for itself in week one.

GPU egress is real

AWS, Azure, and Google Cloud all charge egress on streaming completions, and for long-output workloads (code generation, document drafting) it adds up. Two patterns I now insist on:

  • Keep the model and the consuming service in the same region. Cross-region streaming for chat-class output costs more than people expect.
  • For very high-volume internal-only workloads, run inference inside the VPC of the consumer. Bedrock VPC endpoints, Azure private endpoints, and Vertex private service connect all matter.

Reserved capacity is harder than EC2 reservations ever were

Provisioned throughput on Bedrock, PTUs on Azure OpenAI, and committed-use discounts on Vertex all promise 25–50% savings for committed capacity. The catch:

  • Models change. A PTU you committed to last quarter may be on a deprecated model this quarter.
  • Token mix shifts. Your input/output ratio changes as products mature, and PTU pricing rewards specific shapes of traffic.
  • Multi-model routing breaks the model. If you split traffic across three providers, you cannot easily commit to any one of them.

Practical rule: do not commit until you have at least three months of stable traffic, and keep commitments under 60% of baseline so a model deprecation does not strand you.

What FinOps for AI actually looks like

The FinOps Foundation added an AI workgroup in 2025 and the practices are stabilizing. The real ones I use with clients:

  • Tag every API call with feature, customer tier, and request type. If you cannot, your billing data is useless.
  • Set per-feature token budgets and alarm at 80%. Inference cost runs away in hours, not weeks.
  • Cache aggressively. Prompt caching on Anthropic and OpenAI, context caching on Gemini — for any workload with repeated system prompts, this is 50%+ savings.
  • Run a monthly model-mix review. The cheap-tier models keep getting better; what was Sonnet-only six months ago is now Haiku-capable.
  • Measure cost per user action, not cost per token. Tokens are a unit of supply, not demand.

Takeaways

  • AI inference is now the dominant cloud line item for any company using LLMs in production. Plan accordingly.
  • Tiered model routing is the single highest-leverage cost lever. A well-tuned router cuts spend 50–80%.
  • Caching is free money. Turn it on before anything else.
  • Keep inference and consumers in the same region. Egress on long completions is real.
  • Be cautious with provisioned commitments. Models deprecate faster than reservation terms.
  • Tag everything from day one. AI FinOps without per-call attribution is just guessing.

The companies winning on AI cost in 2026 are not the ones using the cheapest model. They are the ones routing intelligently, caching aggressively, and measuring per-feature unit economics. None of that is new in cloud cost; what is new is how badly it punishes you to skip it.