← All case studies
E-commerce·6 weeks (peak readiness) + ongoing retainer·Mid-market DTC e-commerce platform

Getting an e-commerce platform through Black Friday without a war room

Handled 7.2x the previous year's peak with a single sub-five-minute degradation, no all-hands incident, and a smaller bill than the prior year.

Capacity planningLoad testingAutoscalingIncident response
By the numbers
7.2x YoY
Peak RPS handled
0
Sev-1 incidents during peak
-18% YoY
Total peak-week cloud spend
<340ms
p99 cart latency at peak
Problem

Black Friday 2024 had been a 36-hour war room with three near-misses. Leadership wanted 2025 to be boring, and the team had three different opinions on what had actually saved them last time.

Approach
  1. 01

    Reconstructed the 2024 incident timeline from logs and pages, then ran a blameless retro to separate what worked from what we got lucky on.

  2. 02

    Built a representative load test with k6 that drove real user journeys, not synthetic RPS. Calibrated it against last year's traffic shape.

  3. 03

    Tuned Karpenter consolidation, HPA target utilisation per service, and pre-warmed the stateful tier ahead of the campaign window. Documented why each number was the number.

  4. 04

    Replaced threshold alerts on CPU with burn-rate alerts on the four user journeys that mattered. The on-call team agreed up-front what would and wouldn't get them out of bed.

  5. 05

    Ran two full-scale game days in production (with marketing's blessing) two and four weeks before peak. Found and fixed a Redis connection storm both times.

Result

Peak weekend ran on autopilot. The incident channel had two messages, both informational. The CTO kept the same playbook for 2026 with marginal tweaks, and the team ran their first holiday on-call rotation that didn't burn anyone out.