NAT Gateway Egress Is Eating Your AWS Bill

A client called me in to look at an AWS bill that had drifted up by $40k a month over a year. The CFO wanted to know why. The platform team thought it was "just growth". When I dug in, $14k of the increase was a single line item: NAT Gateway data processing.

This is the most common hidden AWS cost I run into. Every time. Here is the math, the diagnosis, and the fix.

The pricing nobody reads

A NAT Gateway in AWS costs:

Roughly $0.045 per hour. About $33 a month. Fine.
Roughly $0.045 per GB processed. This is the killer.

Note that data processing applies to traffic in both directions. Outbound to the internet, inbound from the response, both billed.

If your private subnet pushes 10TB a month through a NAT Gateway, the data processing charge alone is $450. If it pushes 300TB, it is $13,500. Plus regional cross-AZ transfer if your NAT Gateway is in a different AZ from your workload.

Most teams design the network in week one and never look at the cost again. By year three, the application that started at 100 GB a month is at 200 TB a month, and the bill has quietly compounded.

What was actually happening

For this client, three patterns combined:

S3 traffic going via NAT. Their workload talked to S3 for object storage. S3 lives on the public AWS endpoint. The traffic went through the NAT Gateway. Every GB pulled from S3 was billed at NAT processing rates. They could have used a Gateway VPC Endpoint for S3, which is free.
ECR pulls via NAT. Every container image pull, on every pod start, on every deploy, on every autoscaler scale-out, went through the NAT. Across hundreds of services and rolling deploys, this was many TBs a month. ECR has an Interface Endpoint that costs hourly but kills the per-GB charge.
Cross-AZ NAT. Their default subnet layout put one NAT per AZ for HA, but their Kubernetes nodes routed all NAT traffic through a single AZ's NAT due to a misconfigured route table. So they paid cross-AZ data transfer plus NAT processing on top.

Total preventable: $14k a month. The fix took a sprint.

The diagnosis playbook

Before you change anything, find out where the traffic is going. The order I run this:

Cost Explorer with usage type filter. Filter for USE1-NatGateway-Bytes (or your region equivalent). This tells you the size of the problem.
VPC Flow Logs. Enable them on the NAT subnet for a day. Aggregate by destination. The top destinations will be obvious. If S3, ECR, DynamoDB, or any AWS service is in the top five, you have a VPC endpoint problem.
Cross-AZ check. Look at the Flow Logs source/destination AZs. If most NAT traffic is leaving an AZ different from where it originated, you have a routing problem.

That is enough to scope the fix.

The fix

In order of cost-impact-per-effort:

Use Gateway Endpoints for S3 and DynamoDB

Gateway Endpoints are free. They route S3 and DynamoDB traffic privately, bypassing the NAT entirely. There is no good reason not to have them in every VPC that talks to S3 or DynamoDB. If your IaC does not provision them, fix the module.

Use Interface Endpoints for chatty AWS services

ECR (both ecr.api and ecr.dkr), Secrets Manager, SSM, KMS, STS. These are the high-volume ones for most workloads. Interface Endpoints cost about $7 per endpoint per AZ per month plus a per-GB charge that is much lower than NAT processing. The break-even is low. If you push more than 50GB a month to an AWS service, an Interface Endpoint pays for itself.

Fix cross-AZ routing

Each AZ's private subnets should route to that AZ's NAT. Sounds obvious. Half the IaC modules I see get it wrong because someone copy-pasted a route table.

Consider NAT alternatives for very high egress

If your egress is dominated by one or two destinations, look at:

A small fleet of EC2 instances running a NAT proxy. Higher operational cost, dramatically lower data charges at scale.
For ECR specifically, pull-through cache hosted in a private registry inside the VPC.

These are last-resort options for the few teams pushing petabytes a month. Most teams do not need them.

The cultural part

The reason this bug compounds is that nobody owns the network cost. The platform team owns "the platform". The product teams own "the product". Networking sits between them and gets ignored.

The fix is to put NAT processing into the FinOps dashboard alongside compute and storage. When a single line item crosses 5% of the bill, somebody notices, somebody investigates, and somebody fixes. Without that visibility, $14k a month vanishes into the noise of "AWS spend went up".

What I add to every audit

Three line items I now check on every cloud cost audit:

NAT Gateway data processing.
Cross-AZ data transfer.
Inter-region data transfer.

These three are the silent assassins of cloud bills. Find them, fix them, and most teams free up enough budget for the project they thought they could not afford.