AWS in 2026: Bedrock, Q, and the bet on inference-on-Graviton
Bedrock's catalog tripled, Nova landed, and AWS is betting that most inference does not need GPUs. A field report from production workloads.
AWS spent 2025 looking flat-footed on AI. By April 2026, after the Nova family hit GA, Bedrock catalog crossed 80 first- and third-party models, and the Trainium2-backed inference tier started cutting prices monthly, the picture is different. Not ahead of Google or Microsoft, but no longer a punchline.
Three observations from running production LLM workloads on AWS this quarter.
Bedrock is finally a platform, not a model proxy
For its first 18 months, Bedrock was a thin API in front of Anthropic, Cohere, and Meta. Useful for procurement, not differentiated. The Bedrock of 2026 is different on three axes:
- The catalog now includes Anthropic Claude 4, Mistral Large 3, Llama 4 (yes, finally), DeepSeek R2 in the GovCloud-adjacent region, and the full Amazon Nova line.
- Bedrock Agents shipped a real planner-executor split with checkpoint-and-resume, which means long-running agents survive Lambda timeouts without me writing a state machine.
- Custom Model Import covers fine-tuned Llama and Mistral derivatives without the SageMaker tax. This is the quiet feature that moved a client off self-hosted vLLM.
Amazon Nova deserves a paragraph. Nova Pro is not a Claude 4 competitor; it is a price-aggressive workhorse that wins on the tasks where you would have used GPT-4 Turbo a year ago. For classification, summarization, and structured extraction at scale, Nova Lite at $0.06/1M input tokens is what I default to now.
Amazon Q is two products, and only one is good
Amazon Q Developer is good. It reads your CDK and Terraform, knows your IAM policies, suggests least-privilege fixes that actually work, and the cost-anomaly explanations are better than what I get from third-party tools. I keep it on.
Amazon Q Business is still a confused product. It wants to be a knowledge agent, a workflow builder, and a Slack bot. It does none of them as well as a focused tool. If a client asks me about it, I redirect to Bedrock Agents plus a thin frontend.
Comparing the AWS inference paths
| Path | Latency | $/1M tokens (Llama-class) | Custom-model support | Ops burden | | --------------------------------- | ------- | ------------------------- | ------------------------- | ---------- | | Bedrock on-demand | Low | $0.20–$0.80 | Custom Model Import | None | | Bedrock provisioned throughput | Lowest | Commit pricing, ~30% off | Yes, hourly commit | Low | | SageMaker real-time endpoint | Low | $0.40–$1.50 (compute) | Anything | Medium | | EC2 + Inferentia2 / Trainium2 | Lowest | $0.10–$0.40 effective | Anything, including custom kernels | High |
The interesting trend: provisioned throughput pricing on Bedrock has come down to where the case for managing your own inference plane on EC2 + Inferentia2 only makes sense at very large scale or for models AWS will not host. A year ago, that line was much lower.
The silicon bet
AWS is making a bet most coverage misses: that the median enterprise inference workload does not need an H100. Trainium2 for training, Inferentia2 for hosted inference, and — the sleeper — Graviton4 for CPU-class inference of small models and embedding pipelines.
The Graviton4 inference story is specifically interesting for the boring half of any AI stack:
- BGE and other embedding models run cheaply on Graviton4 with negligible accuracy loss versus GPU.
- Reranking, classification under 1B parameters, and most tabular ML tasks fit on CPU at a fraction of GPU cost.
- The 30%+ price-performance gap on general compute means everything around your inference plane (API gateways, retrieval, evaluation) is cheaper too.
I now spec a default of: Graviton4 for embeddings and orchestration, Bedrock for LLM calls, Inferentia2 only when a specific custom model justifies it. That stack is roughly 40% cheaper than the all-GPU equivalent I was building in early 2025.
Where AWS still loses
- The console. Bedrock's playground, Agents builder, and evaluation tools live in three different UI paradigms. Pick one, please.
- Multimodal. Nova handles vision; audio is behind. If voice is core to your product, Gemini 2 Flash or Azure's OpenAI-backed voice endpoints are still ahead.
- Cross-region failover for Bedrock. Documented, but in practice fragile. Build it yourself if uptime matters.
Takeaways
- Default to Bedrock on-demand. Move to provisioned throughput when you cross roughly $10K/mo on a single model.
- Use Nova Lite for high-volume cheap work. Use Claude 4 on Bedrock for code and complex reasoning. Use everything else only with a clear reason.
- Put embeddings, reranking, and orchestration on Graviton4. Stop renting H100s for tasks that fit in 100ms on a CPU.
- Adopt Q Developer if you live in AWS. Skip Q Business until it picks an identity.
- Read the Bedrock release notes every two weeks; pricing and catalog moves are now monthly.
AWS is no longer an embarrassment on AI. It is, for many production workloads, the pragmatic choice — especially if your data and your team already live in its console.