Bringing an AI infra bill back under control without slowing training
Cut monthly GPU spend by 44% while improving job throughput and shortening time-to-first-token on inference.
- -44%
- Monthly GPU spend
- +27%
- Training job throughput
- -31%
- Inference p50 TTFT
- <90s checkpointed
- Spot interruption recovery
On-demand H100 nodes running 24/7 regardless of queue depth, no checkpointing strategy for spot, and an inference path that overprovisioned because nobody trusted the autoscaler.
- 01
Profiled actual GPU utilisation per workload — training vs fine-tuning vs inference — and discovered three of the largest line items were below 30% utilised.
- 02
Introduced Karpenter with separate node pools for training (spot, large, interruption-tolerant) and inference (on-demand, smaller, latency-sensitive). Wrote disruption budgets per pool.
- 03
Added checkpoint-on-eviction for training jobs with a tested recovery flow, validated by deliberately killing nodes during a real run.
- 04
Replaced the inference autoscaler signal from CPU to in-flight request count plus GPU memory headroom. Set conservative scale-down to protect TTFT.
- 05
Negotiated a GPU committed-use discount sized against the post-tuning baseline, not the pre-tuning bloat. Locked in savings the same week as the optimisation work.
The bill dropped by close to half within the first full billing cycle after tuning. Research velocity went up, not down — fewer queue waits, faster checkpoint recovery, and an inference path with predictable latency. The platform team got time back to work on multi-tenant scheduling.
Other engagements.
Rebuilding the platform under a payments company without slowing the roadmap
Cut deploy time from 38 minutes to under 9, reduced cluster spend by 31%, and got the team out of a quarterly upgrade panic.
Standing up a platform team where there wasn't one
Delivered a working internal developer platform, paved-path service template, and hired the two engineers who own it now.
HIPAA-aligned cloud foundation for a clinical data startup
Cleared HIPAA technical safeguards review with their first enterprise customer's security team — on the first pass.