AWS in 2026: Bedrock, Q, and the bet on inference-on-Graviton

AWS spent 2025 looking flat-footed on AI. By April 2026, after the Nova family hit GA, Bedrock catalog crossed 80 first- and third-party models, and the Trainium2-backed inference tier started cutting prices monthly, the picture is different. Not ahead of Google or Microsoft, but no longer a punchline.

Three observations from running production LLM workloads on AWS this quarter.

Bedrock is finally a platform, not a model proxy

For its first 18 months, Bedrock was a thin API in front of Anthropic, Cohere, and Meta. Useful for procurement, not differentiated. The Bedrock of 2026 is different on three axes:

The catalog now includes Anthropic Claude 4, Mistral Large 3, Llama 4 (yes, finally), DeepSeek R2 in the GovCloud-adjacent region, and the full Amazon Nova line.
Bedrock Agents shipped a real planner-executor split with checkpoint-and-resume, which means long-running agents survive Lambda timeouts without me writing a state machine.
Custom Model Import covers fine-tuned Llama and Mistral derivatives without the SageMaker tax. This is the quiet feature that moved a client off self-hosted vLLM.

Amazon Nova deserves a paragraph. Nova Pro is not a Claude 4 competitor; it is a price-aggressive workhorse that wins on the tasks where you would have used GPT-4 Turbo a year ago. For classification, summarization, and structured extraction at scale, Nova Lite at $0.06/1M input tokens is what I default to now.

Amazon Q is two products, and only one is good

Amazon Q Developer is good. It reads your CDK and Terraform, knows your IAM policies, suggests least-privilege fixes that actually work, and the cost-anomaly explanations are better than what I get from third-party tools. I keep it on.

Amazon Q Business is still a confused product. It wants to be a knowledge agent, a workflow builder, and a Slack bot. It does none of them as well as a focused tool. If a client asks me about it, I redirect to Bedrock Agents plus a thin frontend.

Comparing the AWS inference paths

| Path | Latency | $/1M tokens (Llama-class) | Custom-model support | Ops burden | | --------------------------------- | ------- | ------------------------- | ------------------------- | ---------- | | Bedrock on-demand | Low | $0.20–$0.80 | Custom Model Import | None | | Bedrock provisioned throughput | Lowest | Commit pricing, ~30% off | Yes, hourly commit | Low | | SageMaker real-time endpoint | Low | $0.40–$1.50 (compute) | Anything | Medium | | EC2 + Inferentia2 / Trainium2 | Lowest | $0.10–$0.40 effective | Anything, including custom kernels | High |

The interesting trend: provisioned throughput pricing on Bedrock has come down to where the case for managing your own inference plane on EC2 + Inferentia2 only makes sense at very large scale or for models AWS will not host. A year ago, that line was much lower.

The silicon bet

AWS is making a bet most coverage misses: that the median enterprise inference workload does not need an H100. Trainium2 for training, Inferentia2 for hosted inference, and — the sleeper — Graviton4 for CPU-class inference of small models and embedding pipelines.

The Graviton4 inference story is specifically interesting for the boring half of any AI stack:

BGE and other embedding models run cheaply on Graviton4 with negligible accuracy loss versus GPU.
Reranking, classification under 1B parameters, and most tabular ML tasks fit on CPU at a fraction of GPU cost.
The 30%+ price-performance gap on general compute means everything around your inference plane (API gateways, retrieval, evaluation) is cheaper too.

I now spec a default of: Graviton4 for embeddings and orchestration, Bedrock for LLM calls, Inferentia2 only when a specific custom model justifies it. That stack is roughly 40% cheaper than the all-GPU equivalent I was building in early 2025.

Where AWS still loses

The console. Bedrock's playground, Agents builder, and evaluation tools live in three different UI paradigms. Pick one, please.
Multimodal. Nova handles vision; audio is behind. If voice is core to your product, Gemini 2 Flash or Azure's OpenAI-backed voice endpoints are still ahead.
Cross-region failover for Bedrock. Documented, but in practice fragile. Build it yourself if uptime matters.

Takeaways

Default to Bedrock on-demand. Move to provisioned throughput when you cross roughly $10K/mo on a single model.
Use Nova Lite for high-volume cheap work. Use Claude 4 on Bedrock for code and complex reasoning. Use everything else only with a clear reason.
Put embeddings, reranking, and orchestration on Graviton4. Stop renting H100s for tasks that fit in 100ms on a CPU.
Adopt Q Developer if you live in AWS. Skip Q Business until it picks an identity.
Read the Bedrock release notes every two weeks; pricing and catalog moves are now monthly.

AWS is no longer an embarrassment on AI. It is, for many production workloads, the pragmatic choice — especially if your data and your team already live in its console.

AWS in 2026: Bedrock, Q, and the bet on inference-on-Graviton

Bedrock is finally a platform, not a model proxy

Amazon Q is two products, and only one is good

Comparing the AWS inference paths

The silicon bet

Where AWS still loses

Takeaways

More notes.

Cloud cost in 2026: why your AI workloads are ten times your compute bill

The 'ChatGPT is outdated' narrative is half right

Azure's AI Foundry, Copilot, and the OpenAI partnership in 2026