Prometheus alert threshold tuning: a practical guide

Most Prometheus thresholds were set once, by someone who has since left, using a number that felt right that afternoon. Here's how to choose thresholds and for: durations with actual criteria — and why the two most popular recipes quietly mislead you.

Start from what a threshold actually is

An alert rule is a decision procedure: given this metric's behavior, decide whether the system is broken. Every decision procedure has two error rates — false positives (paging on a healthy system) and false negatives (sleeping through a broken one). Threshold tuning is choosing where you sit on that trade-off.

The mistake almost every team makes is tuning without naming the trade-off. "Set it to 0.8" is not a decision; "we accept roughly five false pages per year from this rule in exchange for catching latency regressions within five minutes" is. You can't get to the second statement by staring at a Grafana panel and squinting. You need to ask, for any candidate threshold: how often will normal behavior cross it?

Mistake #1: the 7-day p95 (or max) as threshold

A common recipe: query the last 7 days, take the p95 (or the max plus some headroom), make that the threshold. It feels empirical. Three problems:

Mistake #2: mean+2σ (the Gaussian folklore)

The other classic: compute mean and standard deviation, set the threshold at μ+2σ (or 3σ), expect the textbook ~2.3% (or 0.13%) exceedance. This imports three assumptions from the textbook that production metrics violate:

None of this means "never use a static threshold". It means the confidence people attach to μ+2σ is unjustified. The number looks scientific; its error rate is unknown.

Choosing for: with criteria

The for: clause is half of every threshold decision, and it's usually copy-pasted as 5m without thought. Two principles:

Also: alert on rates and ratios over windows (rate(errors[5m]) / rate(requests[5m])), not raw counters or instant gauges; and beware aggregation that hides variance — an average across 40 pods can look serene while two pods burn.

The concept that fixes tuning: a false-positive budget per rule

Here's the reframe that turns threshold debates into engineering decisions. For each rule, decide — as a team, on purpose — how many false pages per year this rule is allowed to cost you.

A severity-page rule that wakes a human might get a budget of 2–4/year. A business-hours ticket rule might get 20. Multiply across your rule set and you get the total noise load your on-call has agreed to carry — say, 50 rules × an average budget of 4 = 200 expected false pages a year, about one every other day. If that total horrifies you (it should), you now have a principled way to spend attention: cut budgets on the rules that earn it least.

The budget does two things no amount of threshold-squinting does:

Computing that answer honestly is the hard part: you need the expected rate of sustained crossings of a candidate threshold under your metric's actual behavior — autocorrelation, seasonality and all. The defensible way to get it is resampling your own history with methods that preserve time dependence (block bootstrap rather than point statistics), after excluding known incident windows so you're calibrating against the healthy regime, not against your last outage. That's a real statistics project — a few weeks of focused work to do well in-house — but even the budget framing alone, applied with rough estimates from your alert history (ALERTS has the receipts), will improve most rule sets dramatically.

A tuning loop you can run this quarter

  1. Pull 90 days of alert history. Rank rules by fires-per-month. For each of the top 10, label every firing: real incident, or noise?
  2. Assign each rule an FP budget (page-severity: low single digits per year).
  3. For rules over budget: check the metric's seasonality first (is the threshold only wrong at peak?); then re-derive the threshold against healthy history — excluding incident windows — and re-check the for: against the "how long can this persist harmlessly?" question.
  4. For every change, replay your known past incidents against the new threshold. If a real incident would no longer fire, you've found a real trade-off — decide it consciously.
  5. Re-rank quarterly. Thresholds decay as traffic grows and architecture changes; tuning is a loop, not a project.
If you'd rather not build the statistics yourself: PagerProof runs exactly this analysis as a fixed-price audit — block-bootstrap under H0 of your healthy regime, a proven FP/year per rule with confidence intervals, and calibrated thresholds against your chosen budgets. The methodology is public so you can check the math before paying for the automation. Details →