Stop Setting SLOs on Endpoints. Set Them on Journeys.
Most SLOs I see are bound to HTTP endpoints because that is what the dashboard makes easy. They are also useless. Here is how to design SLOs that mean something.
Most SLOs I review are useless. Not wrong. Useless. They are bound to HTTP endpoints because that is what the dashboard makes easy, they alert on things customers do not feel, and they go green during outages that cost real money.
The fix is not better thresholds. It is moving the level of abstraction up.
The endpoint SLO trap
The default pattern in every monitoring tool: pick a service, pick its main endpoints, set a 99.9% availability SLO and a p99 latency SLO. Done. Page the on-call when the budget burns.
This produces a lot of theatre and not a lot of insight. Three reasons:
- A 99.9% SLO on
POST /ordersdoes not tell you whether anyone successfully placed an order today. The endpoint can be 100% available and the checkout flow still broken because step five depends on an inventory call that nobody scoped. - Endpoint SLOs hide retries. Clients retry. The endpoint reports 99.95% success because nine out of ten failures get recovered by client retry. The user still feels the latency tax.
- Endpoint SLOs invite gaming. Decompose the endpoint into ten smaller endpoints, each with its own SLO, each green. The user experience is unchanged.
Journey SLOs
A user journey is the smallest unit of business value. "Place an order." "Reset my password." "Search and click through to a result." "Stream a video for sixty seconds without buffering."
A journey SLO measures whether the journey succeeded, end to end, from the user's frame of reference. It does not care which microservices were involved.
Defining a journey SLO is more work than an endpoint SLO. That is the point. The work is what makes it useful.
How to define one
Pick the journey. Be ruthless. You probably have five to ten that matter. Anything more and you are not prioritising.
For each, answer four questions:
- What does success mean from outside the system? "The user got a 2xx" is not enough. "The order is durably persisted, the user got a confirmation, and the inventory was decremented" might be.
- Where do you instrument it? Ideally at the edge. A synthetic test that exercises the full journey on a schedule is acceptable. Reading server logs and stitching them together is fragile. Distributed tracing across the journey is the right answer if you have it.
- What is the budget? 99.5%? 99.9%? Higher numbers are not better. The right number is the one where, if you breach it, the business loses real money or trust.
- What latency floor matters? Not the average. The p99 of the journey, end to end, including retries. Users feel tails.
A worked example
For an e-commerce client, the checkout journey SLO looked like this:
- Definition: from
/cart/checkoutclick to order confirmation page render. - Success: order persisted, payment captured, confirmation rendered within 8 seconds at p95.
- Budget: 99.5% over a 28-day rolling window.
- Instrumentation: front-end RUM beacon at start and end, joined with a server-side trace ID.
The first month we ran it, the journey SLO was at 98.2% while every individual endpoint SLO was green. The gap was a payment provider intermittently returning 200 with an empty body. The endpoint logged success. The journey failed. The endpoint SLO had been green for two years through the same bug.
That is the entire point of journey SLOs. They catch the failures the system tells you do not exist.
The pushback you will get
Two objections, both predictable:
- "Journey SLOs are hard to instrument across teams." Yes. That is a feature, not a bug. The work of defining the journey forces the teams that own its parts to agree on what success means. That conversation is more valuable than any dashboard.
- "We will breach the journey SLO and not know which team is at fault." Good. The blameless investigation that follows is more useful than a green endpoint dashboard with a quietly broken product.
What to do next week
Pick one journey. The most revenue-critical one. Define a journey SLO for it using the four questions. Wire up instrumentation, even if it is hacky. Run it for a month.
You will discover at least one persistent failure mode that your endpoint SLOs hid. That is the ROI on the exercise.
After that, do the same for the next four journeys. Stop. Five to ten journey SLOs per product surface is plenty. Anything more and you are back to dashboard theatre.