The PagerProof Methodology
Everything below is public so you can audit it, replicate it, or hire any statistician to check it. What's proprietary is the engine that automates it across hundreds of rules — not the science. If you find an error here, we want to know: hello@pagerproof.com.
1. What H0 means here
Every alert threshold is implicitly a statistical test. The null hypothesis, H0, is: "the system is in its healthy regime." A page is a rejection of H0. A false positive is a rejection while H0 was actually true — the system was fine, and someone got woken up anyway.
Framed this way, the question "where should my threshold be?" becomes a question with a precise answer: given the statistical behavior of this metric while the system is healthy, how often will this rule fire by chance alone? That quantity — expected false positives per year under H0 — is what PagerProof computes for every rule. It is the number nobody in the alerting market gives you, because computing it honestly requires modeling your series' actual dependence structure, not assuming one.
Two things follow from this framing, and both matter:
- The FP/year figure is always conditional: it holds under H0 of the regime we validated against. If your traffic patterns change structurally (a migration, a 10× growth quarter), the condition changes and the calibration should be redone. We say this in every report rather than hiding it in fine print.
- The framing says nothing directly about detection of real incidents (the alternative hypothesis). Anomalies are, by definition, not the healthy regime, so a threshold calibrated under H0 doesn't suppress them — but quantifying recall rigorously requires labeled incidents. See Limitations; we won't pretend otherwise.
2. Why static thresholds (and mean+2σ) fail
The classical recipe — take the mean and standard deviation of the metric, set the threshold at μ+2σ, expect ~2.3% of points above it — rests on assumptions that metric time series violate systematically:
2.1 Autocorrelation
Consecutive samples of latency, CPU, queue depth, or error rate are not independent draws. The value at minute t strongly predicts the value at minute t+1. Positive autocorrelation makes excursions cluster: when the series wanders above its mean, it tends to stay there for a while. The practical effect is that the rate of threshold-crossing events can differ from the i.i.d. prediction by an order of magnitude — and the variance estimate σ itself is biased when computed from dependent samples. The 2.3% in the textbook is a statement about a process your infrastructure does not run.
2.2 Seasonality
Most production metrics have daily and weekly cycles. A single global mean and σ average over Tuesday-10:00 traffic and Sunday-03:00 traffic as if they came from the same distribution. The result is a threshold that is simultaneously too tight for your peak (false positives every Tuesday) and too loose for your trough (a real Sunday-night incident sails under it). Any calibration that ignores the cycle is wrong in both directions at once.
2.3 A sustained crossing is an event, not a point
Prometheus rules don't page on a single sample — the for: clause requires the condition to hold continuously for a duration. So the relevant probability is not "P(one point > threshold)" but "P(the series stays above the threshold for ≥ for: minutes)" — the probability of an excursion of a given length. That probability depends intimately on the autocorrelation structure: a strongly autocorrelated series produces long excursions far more often than an independent one with identical marginal distribution. Pointwise math, however carefully done, cannot answer the question the alert rule is actually asking.
This is the one-line summary of our pitch, and it's literally true: your mean+2σ doesn't know your generating process. Our block bootstrap does — because it never assumes a process at all. It resamples yours.
3. The moving-block bootstrap, explained honestly
The bootstrap is a standard technique for estimating the sampling distribution of a statistic when you can't (or won't) assume a parametric model: resample your own data many times, recompute the statistic each time, read the distribution off the resamples. The naive version resamples individual points — which destroys exactly the time dependence we just argued is essential.
The moving-block bootstrap (Künsch 1989; Liu & Singh 1992) fixes this by resampling contiguous blocks of the series instead of points. Within each block, the autocorrelation, the local trends, and the excursion shapes are preserved verbatim, because the block is a literal slice of your history. The procedure, concretely:
- Take the cleaned healthy-regime history of the metric (see §4 for "cleaned").
- Choose a block length ℓ. We require ℓ ≥ 4× the rule's
for:duration, and longer when the empirical autocorrelation decays slowly. Rationale: the statistic we care about is the rate of sustained excursions of lengthfor:; blocks must be long enough that such excursions live inside blocks rather than being chopped at the seams, otherwise the resampling artificially fragments long excursions and underestimates the FP rate — the dangerous direction. - Build thousands of synthetic series by concatenating randomly drawn (overlapping) blocks until each reaches the original length.
- Replay the alert rule — same PromQL aggregation, same threshold, same
for:— against every synthetic series, and count the firings. - The distribution of firing counts across resamples, scaled to a year, is the estimate of FP/year under H0, with percentile confidence intervals read directly from the resample distribution.
What this buys you: an FP/year estimate that respects your series' actual dependence structure, with no Gaussian assumption, no stationarity-within-the-block assumption beyond what your own data exhibits, and an honest interval instead of a point estimate.
What it does not buy you, stated plainly: blocks are stitched at random seams, so dependence at lags longer than the block length is not preserved — which is why we size blocks against both the for: duration and the empirical autocorrelation decay, and why slow seasonal structure is handled separately (seasonal stratification of blocks: blocks are resampled within comparable time-of-day/day-of-week strata, so a Sunday-night block never stands in for a Tuesday peak). And like every resampling method, it can only show you behaviors your history contains: a regime your system has never visited is invisible to it. That last point is a fundamental limit of any data-driven method, ours included — see Limitations.
4. Excluding anomalous windows from the baseline
This is the step most home-grown calibrations skip, and it's the expensive mistake. The FP/year figure is defined under H0 — the healthy regime. If your 90-day history contains three incidents and a deploy gone wrong, and you bootstrap over all of it, you are calibrating against a baseline contaminated with the very anomalies the rule exists to catch.
Contamination biases the calibration in the worst possible way: incident windows inflate the apparent spread of "normal" behavior, the calibrated threshold moves up to accommodate them, and the rule becomes quietly insensitive to exactly the events it was written for. The threshold looks statistically justified and is operationally blind. You traded false positives for false negatives without anyone deciding to.
So before any resampling, we identify and exclude anomalous windows from the baseline: known incidents from your records where available, plus windows flagged by robust outlier screening (methods based on medians and robust scale estimates, which a handful of extreme windows cannot drag the way they drag a mean and σ). Two honest notes on this step:
- Exclusion is itself a judgment call with a knob. We report exactly which windows were excluded and why, in The Proof, so the call is reviewable — not silently baked into the number.
- There is a circularity risk if exclusion is too aggressive (everything mildly unusual gets labeled "anomaly", the baseline becomes implausibly calm, thresholds get too tight). We keep the exclusion criteria conservative and documented, and the sensitivity analysis in §5 acts as the cross-check.
5. Calibration to a false-positive budget
With the machinery above, calibration inverts the question. Instead of "what FP rate does threshold X give?" we ask: "you've decided this rule deserves at most N false pages per year — what's the lowest threshold that fits in that budget?" The budget is an engineering decision (a page-severity rule might deserve 2/year; a ticket-severity one might tolerate 20); the threshold becomes a derived quantity instead of a guessed one.
For each rule, The Proof reports: the current threshold and its measured FP/year (most teams see their noisiest rules at 50–100+/year — that's the "verdict: noisy" line in the report); the calibrated threshold meeting the agreed budget, with its confidence interval; and a sensitivity analysis — we replay the historical incident windows (the ones excluded from the baseline in §4) against the new threshold and report which would still have fired, and with how much margin. Where we adjust for: durations as well as thresholds, both the old and new excursion statistics are shown.
Because an audit calibrates many rules at once, we account for the multiplicity: fifty rules each individually calibrated to a small FP rate still add up across the rule set, and the report presents both the per-rule and the aggregate expected page load, so the on-call budget is honest at the level your humans actually experience it — total pages, not per-rule fictions.
Rules where the history cannot support a defensible number — too little data, a metric that changed identity mid-history, a regime shift we can't bridge — are marked "do not touch: insufficient data" with the reason. A calibration we can't defend with numbers is a guess wearing a lab coat, and we don't ship those.
6. Limitations — read this before buying
This section is prominent on purpose. Every method has limits; vendors who don't state theirs are asking you to discover them in production.
- Recall is not measurable without labeled incidents. FP/year under H0 is a number we can demonstrate. The complementary quantity — what fraction of real incidents the rule catches — requires a labeled incident history, which most teams don't have in statistically useful quantity. The sensitivity replay (§5) checks the new thresholds against the incidents you do have on record, but a handful of incidents is anecdote, not measurement. We will never quote you a "recall rate"; anyone who does, with the data you have, is making it up.
- Histories shorter than ~14 days give high uncertainty. The bootstrap reads its evidence from your history. With less than about two weeks of data the confidence intervals on FP/year get wide enough to be honest but not very useful, and weekly seasonality has been observed exactly once or twice. We'll still run the analysis, but those rules come back flagged with their wide intervals visible — and often land in "do not touch" territory.
- Multi-series rules are analyzed as their max-envelope. A rule templated over hundreds of label combinations (per-pod, per-instance) is analyzed against the maximum envelope across series — the worst case that would trigger the alert. This is conservative and tractable, but it means per-series nuance (one specifically pathological pod) is summarized, not individually modeled. For rules where per-series behavior diverges wildly, we say so in the report.
- Calibrated thresholds require human review. The numbers tell you what normal behavior looks like statistically. They do not know that the CFO demo is on Thursday, that this service is being deprecated, or that a "statistically fine" latency level violates a contractual SLO. Every threshold in The Proof is a recommendation with its evidence attached — your engineers approve each one. We consider this a feature: an alerting change nobody reviewed is how this mess started.
- The estimate is conditional on the validated regime. Stated in §1, worth repeating: structural change in your traffic invalidates the condition under which FP/year was computed. For fast-moving systems that's the case for continuous recalibration (PagerProof Server); for stable systems, an audit's thresholds genuinely last — and we'll tell you which kind you are rather than sell you the subscription regardless.
7. References & further reading
- Künsch, H. R. (1989). The Jackknife and the Bootstrap for General Stationary Observations. Annals of Statistics 17(3).
- Liu, R. Y. & Singh, K. (1992). Moving blocks jackknife and bootstrap capture weak dependence. In Exploring the Limits of Bootstrap.
- Lahiri, S. N. (2003). Resampling Methods for Dependent Data. Springer.
- Industry figures cited on the landing page: static-threshold FP share 20–40% vs. 5–15% calibrated (openobserve, AIOps guide); 45–90 min per dispatched false positive and up-to-27 h/week for a team of three (reliamag/oxmaint); alert-fatigue habituation (AssetWatch).