The PagerProof Methodology

Everything below is public so you can audit it, replicate it, or hire any statistician to check it. What's proprietary is the engine that automates it across hundreds of rules — not the science. If you find an error here, we want to know: hello@pagerproof.com.

1. What H0 means here

Every alert threshold is implicitly a statistical test. The null hypothesis, H0, is: "the system is in its healthy regime." A page is a rejection of H0. A false positive is a rejection while H0 was actually true — the system was fine, and someone got woken up anyway.

Framed this way, the question "where should my threshold be?" becomes a question with a precise answer: given the statistical behavior of this metric while the system is healthy, how often will this rule fire by chance alone? That quantity — expected false positives per year under H0 — is what PagerProof computes for every rule. It is the number nobody in the alerting market gives you, because computing it honestly requires modeling your series' actual dependence structure, not assuming one.

Two things follow from this framing, and both matter:

2. Why static thresholds (and mean+2σ) fail

The classical recipe — take the mean and standard deviation of the metric, set the threshold at μ+2σ, expect ~2.3% of points above it — rests on assumptions that metric time series violate systematically:

2.1 Autocorrelation

Consecutive samples of latency, CPU, queue depth, or error rate are not independent draws. The value at minute t strongly predicts the value at minute t+1. Positive autocorrelation makes excursions cluster: when the series wanders above its mean, it tends to stay there for a while. The practical effect is that the rate of threshold-crossing events can differ from the i.i.d. prediction by an order of magnitude — and the variance estimate σ itself is biased when computed from dependent samples. The 2.3% in the textbook is a statement about a process your infrastructure does not run.

2.2 Seasonality

Most production metrics have daily and weekly cycles. A single global mean and σ average over Tuesday-10:00 traffic and Sunday-03:00 traffic as if they came from the same distribution. The result is a threshold that is simultaneously too tight for your peak (false positives every Tuesday) and too loose for your trough (a real Sunday-night incident sails under it). Any calibration that ignores the cycle is wrong in both directions at once.

2.3 A sustained crossing is an event, not a point

Prometheus rules don't page on a single sample — the for: clause requires the condition to hold continuously for a duration. So the relevant probability is not "P(one point > threshold)" but "P(the series stays above the threshold for ≥ for: minutes)" — the probability of an excursion of a given length. That probability depends intimately on the autocorrelation structure: a strongly autocorrelated series produces long excursions far more often than an independent one with identical marginal distribution. Pointwise math, however carefully done, cannot answer the question the alert rule is actually asking.

This is the one-line summary of our pitch, and it's literally true: your mean+2σ doesn't know your generating process. Our block bootstrap does — because it never assumes a process at all. It resamples yours.

3. The moving-block bootstrap, explained honestly

The bootstrap is a standard technique for estimating the sampling distribution of a statistic when you can't (or won't) assume a parametric model: resample your own data many times, recompute the statistic each time, read the distribution off the resamples. The naive version resamples individual points — which destroys exactly the time dependence we just argued is essential.

The moving-block bootstrap (Künsch 1989; Liu & Singh 1992) fixes this by resampling contiguous blocks of the series instead of points. Within each block, the autocorrelation, the local trends, and the excursion shapes are preserved verbatim, because the block is a literal slice of your history. The procedure, concretely:

  1. Take the cleaned healthy-regime history of the metric (see §4 for "cleaned").
  2. Choose a block length . We require ℓ ≥ 4× the rule's for: duration, and longer when the empirical autocorrelation decays slowly. Rationale: the statistic we care about is the rate of sustained excursions of length for:; blocks must be long enough that such excursions live inside blocks rather than being chopped at the seams, otherwise the resampling artificially fragments long excursions and underestimates the FP rate — the dangerous direction.
  3. Build thousands of synthetic series by concatenating randomly drawn (overlapping) blocks until each reaches the original length.
  4. Replay the alert rule — same PromQL aggregation, same threshold, same for: — against every synthetic series, and count the firings.
  5. The distribution of firing counts across resamples, scaled to a year, is the estimate of FP/year under H0, with percentile confidence intervals read directly from the resample distribution.

What this buys you: an FP/year estimate that respects your series' actual dependence structure, with no Gaussian assumption, no stationarity-within-the-block assumption beyond what your own data exhibits, and an honest interval instead of a point estimate.

What it does not buy you, stated plainly: blocks are stitched at random seams, so dependence at lags longer than the block length is not preserved — which is why we size blocks against both the for: duration and the empirical autocorrelation decay, and why slow seasonal structure is handled separately (seasonal stratification of blocks: blocks are resampled within comparable time-of-day/day-of-week strata, so a Sunday-night block never stands in for a Tuesday peak). And like every resampling method, it can only show you behaviors your history contains: a regime your system has never visited is invisible to it. That last point is a fundamental limit of any data-driven method, ours included — see Limitations.

4. Excluding anomalous windows from the baseline

This is the step most home-grown calibrations skip, and it's the expensive mistake. The FP/year figure is defined under H0 — the healthy regime. If your 90-day history contains three incidents and a deploy gone wrong, and you bootstrap over all of it, you are calibrating against a baseline contaminated with the very anomalies the rule exists to catch.

Contamination biases the calibration in the worst possible way: incident windows inflate the apparent spread of "normal" behavior, the calibrated threshold moves up to accommodate them, and the rule becomes quietly insensitive to exactly the events it was written for. The threshold looks statistically justified and is operationally blind. You traded false positives for false negatives without anyone deciding to.

So before any resampling, we identify and exclude anomalous windows from the baseline: known incidents from your records where available, plus windows flagged by robust outlier screening (methods based on medians and robust scale estimates, which a handful of extreme windows cannot drag the way they drag a mean and σ). Two honest notes on this step:

5. Calibration to a false-positive budget

With the machinery above, calibration inverts the question. Instead of "what FP rate does threshold X give?" we ask: "you've decided this rule deserves at most N false pages per year — what's the lowest threshold that fits in that budget?" The budget is an engineering decision (a page-severity rule might deserve 2/year; a ticket-severity one might tolerate 20); the threshold becomes a derived quantity instead of a guessed one.

For each rule, The Proof reports: the current threshold and its measured FP/year (most teams see their noisiest rules at 50–100+/year — that's the "verdict: noisy" line in the report); the calibrated threshold meeting the agreed budget, with its confidence interval; and a sensitivity analysis — we replay the historical incident windows (the ones excluded from the baseline in §4) against the new threshold and report which would still have fired, and with how much margin. Where we adjust for: durations as well as thresholds, both the old and new excursion statistics are shown.

Because an audit calibrates many rules at once, we account for the multiplicity: fifty rules each individually calibrated to a small FP rate still add up across the rule set, and the report presents both the per-rule and the aggregate expected page load, so the on-call budget is honest at the level your humans actually experience it — total pages, not per-rule fictions.

Rules where the history cannot support a defensible number — too little data, a metric that changed identity mid-history, a regime shift we can't bridge — are marked "do not touch: insufficient data" with the reason. A calibration we can't defend with numbers is a guess wearing a lab coat, and we don't ship those.

6. Limitations — read this before buying

This section is prominent on purpose. Every method has limits; vendors who don't state theirs are asking you to discover them in production.

  • Recall is not measurable without labeled incidents. FP/year under H0 is a number we can demonstrate. The complementary quantity — what fraction of real incidents the rule catches — requires a labeled incident history, which most teams don't have in statistically useful quantity. The sensitivity replay (§5) checks the new thresholds against the incidents you do have on record, but a handful of incidents is anecdote, not measurement. We will never quote you a "recall rate"; anyone who does, with the data you have, is making it up.
  • Histories shorter than ~14 days give high uncertainty. The bootstrap reads its evidence from your history. With less than about two weeks of data the confidence intervals on FP/year get wide enough to be honest but not very useful, and weekly seasonality has been observed exactly once or twice. We'll still run the analysis, but those rules come back flagged with their wide intervals visible — and often land in "do not touch" territory.
  • Multi-series rules are analyzed as their max-envelope. A rule templated over hundreds of label combinations (per-pod, per-instance) is analyzed against the maximum envelope across series — the worst case that would trigger the alert. This is conservative and tractable, but it means per-series nuance (one specifically pathological pod) is summarized, not individually modeled. For rules where per-series behavior diverges wildly, we say so in the report.
  • Calibrated thresholds require human review. The numbers tell you what normal behavior looks like statistically. They do not know that the CFO demo is on Thursday, that this service is being deprecated, or that a "statistically fine" latency level violates a contractual SLO. Every threshold in The Proof is a recommendation with its evidence attached — your engineers approve each one. We consider this a feature: an alerting change nobody reviewed is how this mess started.
  • The estimate is conditional on the validated regime. Stated in §1, worth repeating: structural change in your traffic invalidates the condition under which FP/year was computed. For fast-moving systems that's the case for continuous recalibration (PagerProof Server); for stable systems, an audit's thresholds genuinely last — and we'll tell you which kind you are rather than sell you the subscription regardless.

7. References & further reading

Want this applied to your rules? The audit is fixed-price (€1,500 for up to 50 rules), comes with a money-back guarantee if we can't show a ≥30% projected FP reduction while maintaining detection, and the price is credited toward a Server subscription. Pricing and details →