How we cut alert fatigue by 60% with AI triage

Aisha Patel

14 April, 2026

How we cut alert fatigue by 60% with AI triage

Alert fatigue doesn't just tire out engineers — it erodes trust in the entire monitoring system. We built an AI triage layer that cut weekly pages by 60% while keeping our P1 detection rate at 100%.


Alert fatigue is one of the most insidious problems in on-call engineering. When every alert feels like noise, engineers stop trusting the signal — and that's when real incidents get missed. Here's how we reduced alert volume by 60% without losing coverage.

What alert fatigue actually costs

It's not just about tired engineers. Alert fatigue has measurable consequences:

  • Slower response times — engineers learn to wait before acting, expecting alerts to auto-resolve
  • Missed critical incidents — real issues get buried in noise
  • On-call burnout — teams start dreading their rotation
  • Alert tuning debt — nobody has time to fix the alerts that shouldn't fire

The real cost isn't the alerts themselves. It's the trust damage they cause over time.

Our alert landscape before

Before the changes, our team was receiving an average of 340 alerts per week. Of those:

  • 58% auto-resolved within 5 minutes (pure noise)
  • 22% were duplicates of an active incident
  • 12% required action but not immediate response
  • 8% were genuinely urgent

An engineer responding to every alert was making 340 decisions a week — most of them pointless.

The AI triage layer

We built a lightweight triage layer that sits between our alerting system and our on-call engineer. When an alert fires, the AI:

  1. Checks if a related incident is already open
  2. Looks at historical resolution rate for this alert type in the past 30 days
  3. Checks current service health across dependencies
  4. Assigns a confidence score for "requires human attention"

Alerts below a confidence threshold get logged but not paged. Engineers can review them async.

Engineering monitoring dashboard

What we tuned

The triage model started with three inputs: alert name, service, and time of day. That alone got us a 35% reduction in pages. Adding historical resolution data got us to 52%. The final 8% came from dependency health context.

Critically, we kept humans in the loop on tuning. Any alert the AI suppressed that later became a real incident got flagged for review. The model updated its priors. False negatives dropped to near zero within 6 weeks.

The outcome

After 90 days:

  • Pages reduced from 340/week to 136/week
  • MTTR improved by 18% (fewer interruptions = better focus when it matters)
  • On-call NPS went from -12 to +34
  • Zero missed P1 incidents during the trial period

The 60% headline number is real, but the more important number is that last one. Reducing noise only matters if you don't lose signal.

What we'd do differently

Start with a longer shadow period. We ran the AI in "observe only" mode for 4 weeks before letting it suppress pages. In retrospect we should have done 8 weeks — the model needed more data on low-frequency alert types before we trusted its judgment on them.