Rebuilding the platform under a payments company without slowing the roadmap
Cut deploy time from 38 minutes to under 9, reduced cluster spend by 31%, and got the team out of a quarterly upgrade panic.
- 38m → 9m
- Mainline build time
- -31%
- Monthly cluster spend
- -22%
- p95 checkout latency
- 47 → 6
- On-call pages / week
Two EKS clusters that had drifted from each other, Helm releases applied from laptops, no error budget, and a CI pipeline so slow that engineers batched merges to avoid waiting. Auditors had questions the team couldn't answer in less than a week.
- 01
Audited the current state across infrastructure, CI/CD, observability, and on-call. Wrote a single-page assessment with the three things that mattered and the eleven that didn't.
- 02
Replaced Helm-from-laptop with ArgoCD app-of-apps, pinned cluster versions, and introduced a real promotion path from staging to two production regions via PR.
- 03
Rebuilt the GitHub Actions pipeline around a layered cache strategy and selective test execution. Killed three abandoned scanners, kept the four that actually blocked merges.
- 04
Defined SLOs per user journey (checkout, payout, dispute) and wired Prometheus burn-rate alerts. Deleted 70% of existing alerts on day one with the on-call team's blessing.
- 05
Introduced Karpenter with consolidated workload classes, right-sized stateful tier, and structured a savings plan against the post-tuning baseline rather than the bloated one.
Three months in, deploys had moved from late evenings to mid-afternoon and the team shipped a multi-region failover drill on schedule. The platform passed its first PCI re-audit without a single open finding, and the on-call rotation became something engineers stopped trying to swap out of.
Other engagements.
Standing up a platform team where there wasn't one
Delivered a working internal developer platform, paved-path service template, and hired the two engineers who own it now.
HIPAA-aligned cloud foundation for a clinical data startup
Cleared HIPAA technical safeguards review with their first enterprise customer's security team — on the first pass.
Getting an e-commerce platform through Black Friday without a war room
Handled 7.2x the previous year's peak with a single sub-five-minute degradation, no all-hands incident, and a smaller bill than the prior year.