← All case studies
AI/ML Infrastructure·12 weeks·Series-B AI/ML platform (model training & inference)

Bringing an AI infra bill back under control without slowing training

Cut monthly GPU spend by 44% while improving job throughput and shortening time-to-first-token on inference.

GPU schedulingFinOpsKarpenterSpotInference
By the numbers
-44%
Monthly GPU spend
+27%
Training job throughput
-31%
Inference p50 TTFT
<90s checkpointed
Spot interruption recovery
Problem

On-demand H100 nodes running 24/7 regardless of queue depth, no checkpointing strategy for spot, and an inference path that overprovisioned because nobody trusted the autoscaler.

Approach
  1. 01

    Profiled actual GPU utilisation per workload — training vs fine-tuning vs inference — and discovered three of the largest line items were below 30% utilised.

  2. 02

    Introduced Karpenter with separate node pools for training (spot, large, interruption-tolerant) and inference (on-demand, smaller, latency-sensitive). Wrote disruption budgets per pool.

  3. 03

    Added checkpoint-on-eviction for training jobs with a tested recovery flow, validated by deliberately killing nodes during a real run.

  4. 04

    Replaced the inference autoscaler signal from CPU to in-flight request count plus GPU memory headroom. Set conservative scale-down to protect TTFT.

  5. 05

    Negotiated a GPU committed-use discount sized against the post-tuning baseline, not the pre-tuning bloat. Locked in savings the same week as the optimisation work.

Result

The bill dropped by close to half within the first full billing cycle after tuning. Research velocity went up, not down — fewer queue waits, faster checkpoint recovery, and an inference path with predictable latency. The platform team got time back to work on multi-tenant scheduling.