Almost every catastrophic outage I've seen at scale traces to the database. Untested backups. Long-running migrations holding locks. Replicas drifting. Connection pools exhausted. The fixes are well known. They just don't get done.

What I deliver

HA & DR — Primary/replica topology, automated failover (Patroni, RDS multi-AZ, Cloud SQL HA), tested DR procedures with documented RTO/RPO. Quarterly restore drills — yes, actually run them.

Migration safety — Online schema change framework. PostgreSQL: pgroll or pg-osc; MySQL: gh-ost. Migrations run on PRs with timing estimates. No more 'we ran a migration and the site went down for 40 minutes'.

Connection management — PgBouncer / ProxySQL configured with sane pool sizes. Connection storms from autoscaling pods diagnosed and fixed.

Performance baseline — Slow query log review cadence, index health monitoring, query plan regression detection. Top 20 queries by time profiled monthly.

Backups that restore — Documented backup tier, retention policy, encryption, off-account replication, restore time tested. The backup that hasn't been restored doesn't exist.

When this matters most

Pre-Series-B teams with one Postgres and growing fast, post-incident teams that just lost data and want it never to happen again, or platforms operating multi-tenant databases where the blast radius of one bad query is everyone.

Database Operations

What I deliver

When this matters most

Adjacent services.

Cloud & DevOps Engineering

Platform Engineering

Kubernetes & Container Orchestration