← All services
Postgres · MySQL · Redis · Kafka

Database Operations

Production database operations: HA topology, backup and restore drills, migration strategy, performance tuning, connection management, and the unsexy work that decides whether outages are recoverable.

PostgreSQLMySQLRedisKafkaPgBouncerpgrollVitess

Almost every catastrophic outage I've seen at scale traces to the database. Untested backups. Long-running migrations holding locks. Replicas drifting. Connection pools exhausted. The fixes are well known. They just don't get done.

What I deliver

HA & DR — Primary/replica topology, automated failover (Patroni, RDS multi-AZ, Cloud SQL HA), tested DR procedures with documented RTO/RPO. Quarterly restore drills — yes, actually run them.

Migration safety — Online schema change framework. PostgreSQL: pgroll or pg-osc; MySQL: gh-ost. Migrations run on PRs with timing estimates. No more 'we ran a migration and the site went down for 40 minutes'.

Connection management — PgBouncer / ProxySQL configured with sane pool sizes. Connection storms from autoscaling pods diagnosed and fixed.

Performance baseline — Slow query log review cadence, index health monitoring, query plan regression detection. Top 20 queries by time profiled monthly.

Backups that restore — Documented backup tier, retention policy, encryption, off-account replication, restore time tested. The backup that hasn't been restored doesn't exist.

When this matters most

Pre-Series-B teams with one Postgres and growing fast, post-incident teams that just lost data and want it never to happen again, or platforms operating multi-tenant databases where the blast radius of one bad query is everyone.