How we cut alert fatigue by 60% with AI triage
Aisha Patel
14 April, 2026
Alert fatigue doesn't just tire out engineers — it erodes trust in the entire monitoring system. We built an AI triage layer that cut weekly pages by 60% while keeping our P1 detection rate at 100%.
Alert fatigue is one of the most insidious problems in on-call engineering. When every alert feels like noise, engineers stop trusting the signal — and that's when real incidents get missed. Here's how we reduced alert volume by 60% without losing coverage.
It's not just about tired engineers. Alert fatigue has measurable consequences:
The real cost isn't the alerts themselves. It's the trust damage they cause over time.
Before the changes, our team was receiving an average of 340 alerts per week. Of those:
An engineer responding to every alert was making 340 decisions a week — most of them pointless.
We built a lightweight triage layer that sits between our alerting system and our on-call engineer. When an alert fires, the AI:
Alerts below a confidence threshold get logged but not paged. Engineers can review them async.
The triage model started with three inputs: alert name, service, and time of day. That alone got us a 35% reduction in pages. Adding historical resolution data got us to 52%. The final 8% came from dependency health context.
Critically, we kept humans in the loop on tuning. Any alert the AI suppressed that later became a real incident got flagged for review. The model updated its priors. False negatives dropped to near zero within 6 weeks.
After 90 days:
The 60% headline number is real, but the more important number is that last one. Reducing noise only matters if you don't lose signal.
Start with a longer shadow period. We ran the AI in "observe only" mode for 4 weeks before letting it suppress pages. In retrospect we should have done 8 weeks — the model needed more data on low-frequency alert types before we trusted its judgment on them.