← All posts
aiawsbedrockgravitoncloud

AWS in 2026: Bedrock, Q, and the bet on inference-on-Graviton

Bedrock's catalog tripled, Nova landed, and AWS is betting that most inference does not need GPUs. A field report from production workloads.

9 April 2026·5 min read

AWS spent 2025 looking flat-footed on AI. By April 2026, after the Nova family hit GA, Bedrock catalog crossed 80 first- and third-party models, and the Trainium2-backed inference tier started cutting prices monthly, the picture is different. Not ahead of Google or Microsoft, but no longer a punchline.

Three observations from running production LLM workloads on AWS this quarter.

Bedrock is finally a platform, not a model proxy

For its first 18 months, Bedrock was a thin API in front of Anthropic, Cohere, and Meta. Useful for procurement, not differentiated. The Bedrock of 2026 is different on three axes:

  • The catalog now includes Anthropic Claude 4, Mistral Large 3, Llama 4 (yes, finally), DeepSeek R2 in the GovCloud-adjacent region, and the full Amazon Nova line.
  • Bedrock Agents shipped a real planner-executor split with checkpoint-and-resume, which means long-running agents survive Lambda timeouts without me writing a state machine.
  • Custom Model Import covers fine-tuned Llama and Mistral derivatives without the SageMaker tax. This is the quiet feature that moved a client off self-hosted vLLM.

Amazon Nova deserves a paragraph. Nova Pro is not a Claude 4 competitor; it is a price-aggressive workhorse that wins on the tasks where you would have used GPT-4 Turbo a year ago. For classification, summarization, and structured extraction at scale, Nova Lite at $0.06/1M input tokens is what I default to now.

Amazon Q is two products, and only one is good

Amazon Q Developer is good. It reads your CDK and Terraform, knows your IAM policies, suggests least-privilege fixes that actually work, and the cost-anomaly explanations are better than what I get from third-party tools. I keep it on.

Amazon Q Business is still a confused product. It wants to be a knowledge agent, a workflow builder, and a Slack bot. It does none of them as well as a focused tool. If a client asks me about it, I redirect to Bedrock Agents plus a thin frontend.

Comparing the AWS inference paths

| Path | Latency | $/1M tokens (Llama-class) | Custom-model support | Ops burden | | --------------------------------- | ------- | ------------------------- | ------------------------- | ---------- | | Bedrock on-demand | Low | $0.20–$0.80 | Custom Model Import | None | | Bedrock provisioned throughput | Lowest | Commit pricing, ~30% off | Yes, hourly commit | Low | | SageMaker real-time endpoint | Low | $0.40–$1.50 (compute) | Anything | Medium | | EC2 + Inferentia2 / Trainium2 | Lowest | $0.10–$0.40 effective | Anything, including custom kernels | High |

The interesting trend: provisioned throughput pricing on Bedrock has come down to where the case for managing your own inference plane on EC2 + Inferentia2 only makes sense at very large scale or for models AWS will not host. A year ago, that line was much lower.

The silicon bet

AWS is making a bet most coverage misses: that the median enterprise inference workload does not need an H100. Trainium2 for training, Inferentia2 for hosted inference, and — the sleeper — Graviton4 for CPU-class inference of small models and embedding pipelines.

The Graviton4 inference story is specifically interesting for the boring half of any AI stack:

  • BGE and other embedding models run cheaply on Graviton4 with negligible accuracy loss versus GPU.
  • Reranking, classification under 1B parameters, and most tabular ML tasks fit on CPU at a fraction of GPU cost.
  • The 30%+ price-performance gap on general compute means everything around your inference plane (API gateways, retrieval, evaluation) is cheaper too.

I now spec a default of: Graviton4 for embeddings and orchestration, Bedrock for LLM calls, Inferentia2 only when a specific custom model justifies it. That stack is roughly 40% cheaper than the all-GPU equivalent I was building in early 2025.

Where AWS still loses

  • The console. Bedrock's playground, Agents builder, and evaluation tools live in three different UI paradigms. Pick one, please.
  • Multimodal. Nova handles vision; audio is behind. If voice is core to your product, Gemini 2 Flash or Azure's OpenAI-backed voice endpoints are still ahead.
  • Cross-region failover for Bedrock. Documented, but in practice fragile. Build it yourself if uptime matters.

Takeaways

  • Default to Bedrock on-demand. Move to provisioned throughput when you cross roughly $10K/mo on a single model.
  • Use Nova Lite for high-volume cheap work. Use Claude 4 on Bedrock for code and complex reasoning. Use everything else only with a clear reason.
  • Put embeddings, reranking, and orchestration on Graviton4. Stop renting H100s for tasks that fit in 100ms on a CPU.
  • Adopt Q Developer if you live in AWS. Skip Q Business until it picks an identity.
  • Read the Bedrock release notes every two weeks; pricing and catalog moves are now monthly.

AWS is no longer an embarrassment on AI. It is, for many production workloads, the pragmatic choice — especially if your data and your team already live in its console.