How to reduce alert fatigue in Prometheus without losing real alerts

Alert fatigue isn't a morale problem with a tooling smell — it's a tooling problem with a morale cost. Here's how to recognize it, what it's actually costing you in hours, and a triage framework plus Alertmanager quick wins that don't require trusting anyone's "AI".

The symptoms (you probably have several)

The last two are the dangerous ones. The industry observation is blunt: after enough false alarms, the team starts ignoring the warnings (AssetWatch). At that point your alerting system still costs money and sleep — it just no longer provides detection. It is functionally dead, and the outage that proves it is already scheduled; you just don't know the date.

What it costs, in numbers you can show your manager

Industry figures, sources in brackets:

And the costs the spreadsheet doesn't capture: every false page erodes the credibility of the next real one, off-hours pages tax sleep and retention, and on-call dread is consistently one of the reasons platform engineers quit. If your team has had a departure where the pager was mentioned in the exit conversation, you already know this isn't hypothetical.

A triage framework: noisy / ok / insensitive

You can't fix a 300-rule set rule-by-rule on vibes. Triage first. Pull 90 days of alert history (the ALERTS metric in Prometheus, or your Alertmanager/paging-tool logs) and put every rule that fired into one of three buckets:

BucketDefinitionAction
Noisy Fires regularly; most firings led to no action. The honest test: "when this fired, did anyone do anything?" Recalibrate the threshold/for: against healthy history, demote from page to ticket, or delete. These rules are spending your team's attention budget and buying nothing.
OK Fires rarely; when it fires, action follows. The rule earns its keep. Leave it alone. Document why it's good so it survives the next re-org.
Insensitive Never fires — including during incidents it should have caught. Found by replaying past incidents: "which rules should have fired here and didn't?" The scary bucket everyone forgets. Lower the threshold or fix the expression — these are the silent false negatives that noisy rule sets hide.

Two notes. First, the "insensitive" bucket is why pure alert-reduction crusades are dangerous: if your only metric is "fewer pages", you'll happily make rules blind. Audit both directions — replay your last five incidents against the rule set every time you tune. Second, expect a power-law: in most setups, 5–10 rules generate 60–80% of the noise. Triage means you fix those first and ignore the long tail until next quarter.

Quick wins in Alertmanager (this week, config only)

These don't fix bad thresholds, but they cut dispatched noise immediately:

The fix underneath: calibrate the thresholds

Everything above manages the noise after the rules fire. The root cause is that the thresholds fire wrongly in the first place — they were guessed, or set with point statistics (a 7-day p95, a mean+2σ) that don't model how metrics actually behave: autocorrelated, seasonal, and paging on sustained crossings rather than single samples. We covered the mechanics — and the false-positive-budget-per-rule concept that makes tuning decisions defensible — in the threshold-tuning guide.

The short version: for each rule, decide how many false pages per year it's allowed to cost, then derive the lowest threshold whose expected false-page rate — estimated against your own healthy history, with time dependence preserved — fits that budget. Do that, and the Alertmanager layer above goes back to being polish on a sound foundation instead of a tourniquet on a guessed one. The 20–40% → 5–15% false-positive improvement cited at the top is precisely the gap between guessed thresholds and calibrated ones.

Sequence that works in practice:

  1. Week 1: triage (noisy/ok/insensitive) from 90 days of history. Grouping + inhibition pass in Alertmanager.
  2. Week 2–3: recalibrate the top 5–10 noisy rules against healthy history; replay past incidents to verify detection; demote what shouldn't page.
  3. Quarterly: re-triage. Thresholds decay as systems change — fatigue creeps back if tuning isn't a loop.
If you want the calibration done with actual statistics: PagerProof audits your rule set with block-bootstrap calibration under H0 and returns every threshold with its proven FP/year — including the noisy/ok/insensitive verdict per rule, and a money-back guarantee if we can't show a ≥30% projected FP reduction while maintaining detection. The methodology is public. Details →