SLOs and burn rates

An SLO is a promise about how reliable a service is. A burn rate is how fast you're breaking that promise. Burn-rate alerts are how GreenSlope decides to page you.

If you've never run SLOs before, this page is enough to start. If you've run them before, skim for GreenSlope-specific defaults.

SLO: an objective with a time window

A Service Level Objective (SLO) is a target expressed over a time window. Examples:

99.9% of HTTP requests to /checkout complete in under 300 ms, over the last 28 days.
99.95% of background jobs succeed, over the last 7 days.

Two things make an SLO an SLO, not just a threshold:

It has a window (7, 14, 28 days — 28 is the GreenSlope default).
It has an error budget: the amount of "not meeting the objective" you can afford before you're officially off-target.

For 99.9% over 28 days, the budget is (1 − 0.999) × 28 days = 40 minutes 38 seconds. That's how much total unavailability the service can rack up in a month and still meet the SLO.

Burn rate: how fast you're spending the budget

If your error budget is 40 minutes over 28 days, a "normal" burn rate is 1× the allowed rate — you'd exhaust the budget exactly at the end of the window.

A burn rate above 1× means you're spending faster than allowed. A burn rate of 14× means you'll exhaust the full 28-day budget in 2 days. That's the alerting signal.

Multi-window burn-rate alerting (Google's default, ours too)

A single threshold on error rate is a bad alert. Too noisy if tight; too slow to fire if loose. GreenSlope defaults to the multi-window burn-rate approach:

Alert severity	Short window	Long window	Burn rate
Page (sev-1)	5 min	1 h	≥ 14.4×
Page (sev-2)	30 min	6 h	≥ 6×
Ticket	2 h	24 h	≥ 3×
Ticket	6 h	72 h	≥ 1×

An alert fires only when both windows cross the threshold simultaneously. The short window catches fast-moving incidents; the long window filters spikes that recover on their own. Together they page when it matters and shut up when it doesn't.

Default SLOs GreenSlope creates

When you add a service, we create two default SLOs you can keep or replace:

Availability — 99.9% of HTTP requests return < 500, over 28 days.
Latency — 99% of HTTP requests complete in under the p95 baseline we observed during the first 72 hours, over 28 days.

Both use the multi-window burn-rate alert table above. If the defaults don't match your workload, define your own in the dashboard under Services → SLOs.

What isn't an SLO

Two things teams commonly call SLOs that aren't:

A dashboard threshold. "Alert me if error rate > 1%" isn't an SLO. It has no window and no budget. It will page you for flaps.
A vibes target. "We want to be reliable" isn't an SLO either. Write down the number, the window, and the budget.

If you can't write it down as "N% of X, over Y days", you don't have an SLO yet.

PreviousChange events NextAlerts