SLOs and burn rates
An SLO is a promise about how reliable a service is. A burn rate is how fast you're breaking that promise. Burn-rate alerts are how GreenSlope decides to page you.
If you've never run SLOs before, this page is enough to start. If you've run them before, skim for GreenSlope-specific defaults.
SLO: an objective with a time window
A Service Level Objective (SLO) is a target expressed over a time window. Examples:
- 99.9% of HTTP requests to
/checkoutcomplete in under 300 ms, over the last 28 days. - 99.95% of background jobs succeed, over the last 7 days.
Two things make an SLO an SLO, not just a threshold:
- It has a window (7, 14, 28 days — 28 is the GreenSlope default).
- It has an error budget: the amount of "not meeting the objective" you can afford before you're officially off-target.
For 99.9% over 28 days, the budget is (1 − 0.999) × 28 days = 40 minutes 38 seconds. That's how much total unavailability the service can rack up
in a month and still meet the SLO.
Burn rate: how fast you're spending the budget
If your error budget is 40 minutes over 28 days, a "normal" burn rate is 1× the allowed rate — you'd exhaust the budget exactly at the end of the window.
A burn rate above 1× means you're spending faster than allowed. A burn rate of 14× means you'll exhaust the full 28-day budget in 2 days. That's the alerting signal.
Multi-window burn-rate alerting (Google's default, ours too)
A single threshold on error rate is a bad alert. Too noisy if tight; too slow to fire if loose. GreenSlope defaults to the multi-window burn-rate approach:
| Alert severity | Short window | Long window | Burn rate |
|---|---|---|---|
| Page (sev-1) | 5 min | 1 h | ≥ 14.4× |
| Page (sev-2) | 30 min | 6 h | ≥ 6× |
| Ticket | 2 h | 24 h | ≥ 3× |
| Ticket | 6 h | 72 h | ≥ 1× |
An alert fires only when both windows cross the threshold simultaneously. The short window catches fast-moving incidents; the long window filters spikes that recover on their own. Together they page when it matters and shut up when it doesn't.
Default SLOs GreenSlope creates
When you add a service, we create two default SLOs you can keep or replace:
- Availability — 99.9% of HTTP requests return < 500, over 28 days.
- Latency — 99% of HTTP requests complete in under the p95 baseline we observed during the first 72 hours, over 28 days.
Both use the multi-window burn-rate alert table above. If the defaults don't match your workload, define your own in the dashboard under Services → SLOs.
What isn't an SLO
Two things teams commonly call SLOs that aren't:
- A dashboard threshold. "Alert me if error rate > 1%" isn't an SLO. It has no window and no budget. It will page you for flaps.
- A vibes target. "We want to be reliable" isn't an SLO either. Write down the number, the window, and the budget.
If you can't write it down as "N% of X, over Y days", you don't have an SLO yet.
Related