Site Reliability & Observability
SLOs, error budgets, alerting that pages people only when it matters, distributed tracing, log strategy, incident response, and postmortems that change the system.
The job of observability isn't to collect metrics. It's to answer questions: is the system healthy, is it degrading, where, why, what changed. Most stacks I encounter are full of dashboards nobody opens and alerts nobody trusts.
What I rebuild
SLOs that mean something — Defined per user journey ('checkout completes in <800ms p95') not per metric. Error budget burn alerts replace 90% of threshold pages. The on-call rotation becomes 1–2 actionable pages per shift, not 30.
Alert hygiene — Audit existing alerts; delete what nobody acts on; tune what's noisy; add what's missing. Alert routing per team and severity. Quiet hours on warning-tier.
Distributed tracing — OpenTelemetry instrumentation, head-based and tail-based sampling depending on volume, trace-driven debugging culture. The 'why is this request slow' question goes from a 2-hour investigation to a 5-minute span lookup.
Logs as a product — Structured logging standard, correlation IDs everywhere, retention by tier, query cost visibility. Stop paying $40k/month for logs no human ever reads.
Incident response — Defined severities, paging matrix, comms templates, blameless postmortems, action items tracked to closure. Incidents become learning artifacts.
Where engagements start
Either an alert audit (typically 2 weeks, immediate quality-of-life win for on-call) or an SLO-first reset of how the team thinks about reliability. Both end with measurable outcomes.
Adjacent services.
Cloud & DevOps Engineering
Production cloud environments designed deliberately — resilient, cost-aware, and ready for the day you actually need them.
Internal developer platformsPlatform Engineering
Self-service platforms that turn 'open a ticket and wait three days' into 'open a PR and ship in fifteen minutes'.
EKS · GKE · AKS · self-hostedKubernetes & Container Orchestration
Production-grade Kubernetes — clusters that scale, upgrade cleanly, and don't wake people up.