← All services
SLOs · alerting · incident response

Site Reliability & Observability

SLOs, error budgets, alerting that pages people only when it matters, distributed tracing, log strategy, incident response, and postmortems that change the system.

PrometheusGrafanaDatadogHoneycombOpenTelemetryLokiTempo

The job of observability isn't to collect metrics. It's to answer questions: is the system healthy, is it degrading, where, why, what changed. Most stacks I encounter are full of dashboards nobody opens and alerts nobody trusts.

What I rebuild

SLOs that mean something — Defined per user journey ('checkout completes in <800ms p95') not per metric. Error budget burn alerts replace 90% of threshold pages. The on-call rotation becomes 1–2 actionable pages per shift, not 30.

Alert hygiene — Audit existing alerts; delete what nobody acts on; tune what's noisy; add what's missing. Alert routing per team and severity. Quiet hours on warning-tier.

Distributed tracing — OpenTelemetry instrumentation, head-based and tail-based sampling depending on volume, trace-driven debugging culture. The 'why is this request slow' question goes from a 2-hour investigation to a 5-minute span lookup.

Logs as a product — Structured logging standard, correlation IDs everywhere, retention by tier, query cost visibility. Stop paying $40k/month for logs no human ever reads.

Incident response — Defined severities, paging matrix, comms templates, blameless postmortems, action items tracked to closure. Incidents become learning artifacts.

Where engagements start

Either an alert audit (typically 2 weeks, immediate quality-of-life win for on-call) or an SLO-first reset of how the team thinks about reliability. Both end with measurable outcomes.