← All case studies
Fintech·14 weeks·Series-C payments platform

Rebuilding the platform under a payments company without slowing the roadmap

Cut deploy time from 38 minutes to under 9, reduced cluster spend by 31%, and got the team out of a quarterly upgrade panic.

KubernetesMulti-regionPCI-DSSGitOpsFinOps
By the numbers
38m → 9m
Mainline build time
-31%
Monthly cluster spend
-22%
p95 checkout latency
47 → 6
On-call pages / week
Problem

Two EKS clusters that had drifted from each other, Helm releases applied from laptops, no error budget, and a CI pipeline so slow that engineers batched merges to avoid waiting. Auditors had questions the team couldn't answer in less than a week.

Approach
  1. 01

    Audited the current state across infrastructure, CI/CD, observability, and on-call. Wrote a single-page assessment with the three things that mattered and the eleven that didn't.

  2. 02

    Replaced Helm-from-laptop with ArgoCD app-of-apps, pinned cluster versions, and introduced a real promotion path from staging to two production regions via PR.

  3. 03

    Rebuilt the GitHub Actions pipeline around a layered cache strategy and selective test execution. Killed three abandoned scanners, kept the four that actually blocked merges.

  4. 04

    Defined SLOs per user journey (checkout, payout, dispute) and wired Prometheus burn-rate alerts. Deleted 70% of existing alerts on day one with the on-call team's blessing.

  5. 05

    Introduced Karpenter with consolidated workload classes, right-sized stateful tier, and structured a savings plan against the post-tuning baseline rather than the bloated one.

Result

Three months in, deploys had moved from late evenings to mid-afternoon and the team shipped a multi-region failover drill on schedule. The platform passed its first PCI re-audit without a single open finding, and the on-call rotation became something engineers stopped trying to swap out of.