Turning your incident data into a knowledge graph

Spencer Cheng

22 April, 2026

Turning your incident data into a knowledge graph

Every incident your team resolves is a lesson waiting to be structured. A knowledge graph transforms scattered post-mortems into a living system your AI can reason over — turning institutional memory into a real-time advantage.

Every time an incident happens, your team generates a wealth of information — timelines, root causes, affected services, and the people involved. Most of this knowledge lives in post-mortems that nobody reads twice. What if you could turn all of that into a structured knowledge graph that your AI systems could actually reason over?

The problem with flat post-mortems

Post-mortems are written in free-form prose. They're useful in the moment but nearly impossible to query at scale. When your 200th incident hits, you can't easily answer questions like:

Which services caused the most cascading failures?
Which engineers have seen this class of bug before?
What runbooks were effective for this symptom pattern?

The data is there — it's just locked in unstructured text.

What a knowledge graph gives you

A knowledge graph models entities (services, engineers, incidents, runbooks) as nodes and their relationships as edges. Once your incidents are in graph form, you can ask questions that flat documents can't answer:

"Show me all incidents that touched the payments service in the last 6 months"
"Which incidents had a resolution time under 30 minutes and why?"
"Find me engineers who've resolved this exact error before"

This is the foundation for AI-powered incident response — an LLM with access to your knowledge graph can reason over your entire incident history, not just a single post-mortem.

A developer working with graph data on a monitor

How to build it

Step 1: Extract entities from post-mortems

Use an LLM to parse each post-mortem and extract structured data:

{
  "incident_id": "INC-1042",
  "services_affected": ["payments-api", "checkout-service"],
  "root_cause": "misconfigured rate limit after deploy",
  "resolved_by": ["alice@gleg.ai"],
  "duration_minutes": 47,
  "runbooks_used": ["RB-22", "RB-08"]
}

Step 2: Model relationships

Map your entities to a graph schema:

Incident → AFFECTED → Service
Engineer → RESOLVED → Incident
Incident → USED → Runbook
Incident → CAUSED_BY → RootCause

Step 3: Wire it to your AI layer

Give your AI agent a set of graph query tools. When an incident fires, it can instantly surface similar past incidents, the engineers who resolved them, and the runbooks that worked.

What this looks like in practice

At Gleg, we've seen teams cut their mean time to resolution by over 40% once their AI has access to a well-structured incident knowledge graph. The AI doesn't replace your on-call engineer — it gives them a 10-year institutional memory in the first 30 seconds of an incident.

The graph also gets smarter over time. Every resolved incident adds new edges, new patterns, and new context. The longer you run it, the more valuable it becomes.

Getting started

You don't need to graph every incident at once. Start with your last 50 post-mortems, extract entities with an LLM, and build a small graph. Run a few queries. See what surfaces. Then scale from there.

The hardest part isn't the technology — it's committing to writing post-mortems consistently enough that the data is worth graphing.