← All case studies
Marketplaces·16 weeks·Series-D logistics marketplace

Real multi-region failover for a two-sided marketplace

Delivered a tested cross-region failover with documented RTO of 12 minutes and RPO under 30 seconds — and proved it in a live game day.

Multi-regionDisaster recoveryPostgreSQLActive-active
By the numbers
12 minutes
Tested RTO
< 30 seconds
Measured RPO
4 (all successful)
Failover game days
< 2.1s
Cross-region data sync lag p99
Problem

A 'multi-region' setup that was a passive copy nobody had ever cut over to. Leadership had been telling enterprise customers it existed; the engineering team knew it didn't really.

Approach
  1. 01

    Started with the truth: wrote a one-page document describing the actual state of cross-region readiness and shared it with leadership before promising anything.

  2. 02

    Built logical replication for the Postgres tier using pglogical with monitored lag, alerts on drift, and a documented promotion procedure.

  3. 03

    Made the application stateless-by-default at the request layer, with idempotency keys on writes that crossed regions and explicit conflict policies on the few that could.

  4. 04

    Wired Route 53 health-checked failover with sane TTLs, and rehearsed DNS propagation with a partner CDN to understand real-world cutover times.

  5. 05

    Ran four game days at increasing severity — staging-only, prod with synthetic traffic, prod with 5% real traffic, full prod cutover. Wrote postmortems for each. Fixed three real bugs only the game days surfaced.

Result

The marketplace now has a failover capability it can actually demonstrate. Two enterprise customers signed contracts that were blocked on it. The team runs a full game day quarterly and treats it as routine, not a project.