Prometheus alert threshold tuning: a practical guide

PagerProof blog · 2026 · ~9 min read

Most Prometheus thresholds were set once, by someone who has since left, using a number that felt right that afternoon. Here's how to choose thresholds and for: durations with actual criteria — and why the two most popular recipes quietly mislead you.

Start from what a threshold actually is

An alert rule is a decision procedure: given this metric's behavior, decide whether the system is broken. Every decision procedure has two error rates — false positives (paging on a healthy system) and false negatives (sleeping through a broken one). Threshold tuning is choosing where you sit on that trade-off.

The mistake almost every team makes is tuning without naming the trade-off. "Set it to 0.8" is not a decision; "we accept roughly five false pages per year from this rule in exchange for catching latency regressions within five minutes" is. You can't get to the second statement by staring at a Grafana panel and squinting. You need to ask, for any candidate threshold: how often will normal behavior cross it?

Mistake #1: the 7-day p95 (or max) as threshold

A common recipe: query the last 7 days, take the p95 (or the max plus some headroom), make that the threshold. It feels empirical. Three problems:

Seven days is one sample of each weekday. Your Tuesday peak appeared exactly once in that window. If it was a quiet Tuesday, your threshold is too tight and next Tuesday pages you. If a deploy spike landed in the window, the threshold is inflated by an anomaly — calibrated to tolerate the very thing you wanted to catch.
A p95 of points says nothing about events. Prometheus pages on sustained crossings (via for:), not single samples. A series can spend 5% of its time above the p95 — by construction it does — and whether that 5% comes as thousands of isolated blips or as three long excursions per week is entirely determined by autocorrelation. Same p95, wildly different page counts.
It has no notion of "healthy". Whatever happened that week, incidents included, is baked in as normal.

Mistake #2: mean+2σ (the Gaussian folklore)

The other classic: compute mean and standard deviation, set the threshold at μ+2σ (or 3σ), expect the textbook ~2.3% (or 0.13%) exceedance. This imports three assumptions from the textbook that production metrics violate:

Independence. Consecutive scrapes are heavily autocorrelated. Excursions cluster: once above the mean, the series tends to stay there. The rate of sustained crossings under autocorrelation can be an order of magnitude away from the i.i.d. prediction — and σ itself is a biased estimate when computed from dependent samples.
One distribution. Daily and weekly seasonality means your metric is a different distribution at Tuesday 10:00 than at Sunday 03:00. A global μ and σ average them into a threshold that's wrong at both times — too tight at peak, too loose at trough.
Symmetry-ish behavior. Latency and queue metrics are heavy-tailed. The Gaussian quantile math is the wrong table to look things up in.

None of this means "never use a static threshold". It means the confidence people attach to μ+2σ is unjustified. The number looks scientific; its error rate is unknown.

Choosing `for:` with criteria

The for: clause is half of every threshold decision, and it's usually copy-pasted as 5m without thought. Two principles:

for: trades detection latency for noise immunity — explicitly. Ask: what's the longest this condition could persist without harm? For a saturating disk, hours — use a long for: or a prediction. For user-facing error rate, maybe 2 minutes. Write the answer in a comment above the rule; future-you will thank you.
for: interacts with your scrape and evaluation intervals. A for: 2m with a 1m evaluation interval means ~2–3 consecutive evaluations must fire. One flappy evaluation resets the clock. If your metric oscillates around the threshold (very common with autocorrelated series sitting near it), you get a rule that's simultaneously noisy in pending and late to firing — the worst of both. The fix is usually moving the threshold, not stretching for: indefinitely: past a point, longer for: just converts false positives into detection delay on real incidents.

Also: alert on rates and ratios over windows (rate(errors[5m]) / rate(requests[5m])), not raw counters or instant gauges; and beware aggregation that hides variance — an average across 40 pods can look serene while two pods burn.

The concept that fixes tuning: a false-positive budget per rule

Here's the reframe that turns threshold debates into engineering decisions. For each rule, decide — as a team, on purpose — how many false pages per year this rule is allowed to cost you.

A severity-page rule that wakes a human might get a budget of 2–4/year. A business-hours ticket rule might get 20. Multiply across your rule set and you get the total noise load your on-call has agreed to carry — say, 50 rules × an average budget of 4 = 200 expected false pages a year, about one every other day. If that total horrifies you (it should), you now have a principled way to spend attention: cut budgets on the rules that earn it least.

The budget does two things no amount of threshold-squinting does:

It makes the trade-off explicit and owned. When the rule fires falsely twice in a month against a budget of 2/year, you don't argue about feelings — the rule is over budget, and it gets recalibrated or demoted.
It makes thresholds derived quantities. The question stops being "what threshold feels right?" and becomes "what's the lowest threshold whose expected false-page rate fits the budget?" — a question with a computable answer.

Computing that answer honestly is the hard part: you need the expected rate of sustained crossings of a candidate threshold under your metric's actual behavior — autocorrelation, seasonality and all. The defensible way to get it is resampling your own history with methods that preserve time dependence (block bootstrap rather than point statistics), after excluding known incident windows so you're calibrating against the healthy regime, not against your last outage. That's a real statistics project — a few weeks of focused work to do well in-house — but even the budget framing alone, applied with rough estimates from your alert history (ALERTS has the receipts), will improve most rule sets dramatically.

A tuning loop you can run this quarter

Pull 90 days of alert history. Rank rules by fires-per-month. For each of the top 10, label every firing: real incident, or noise?
Assign each rule an FP budget (page-severity: low single digits per year).
For rules over budget: check the metric's seasonality first (is the threshold only wrong at peak?); then re-derive the threshold against healthy history — excluding incident windows — and re-check the for: against the "how long can this persist harmlessly?" question.
For every change, replay your known past incidents against the new threshold. If a real incident would no longer fire, you've found a real trade-off — decide it consciously.
Re-rank quarterly. Thresholds decay as traffic grows and architecture changes; tuning is a loop, not a project.

If you'd rather not build the statistics yourself: PagerProof runs exactly this analysis as a fixed-price audit — block-bootstrap under H0 of your healthy regime, a proven FP/year per rule with confidence intervals, and calibrated thresholds against your chosen budgets. The methodology is public so you can check the math before paying for the automation. Details →

Start from what a threshold actually is

Mistake #1: the 7-day p95 (or max) as threshold

Mistake #2: mean+2σ (the Gaussian folklore)

Choosing for: with criteria

The concept that fixes tuning: a false-positive budget per rule

A tuning loop you can run this quarter

Choosing `for:` with criteria