The problem
During an incident, responders burn time hunting across dozens of runbooks and troubleshooting guides — and the ones that matter are often linked in ways plain keyword search can't see. An auth-throttling event shows up as payment timeouts; two guides are related because they query the same telemetry table, not because they share words. Flat similarity search misses exactly those cascade paths.
What I built
A small, hackable graph-RAG engine that turns a folder of runbooks into a queryable knowledge graph. Given an incident description, it returns the most relevant guides plus the graph-linked siblings flat search would miss — along with the exact telemetry queries to run and the documented mitigations, each attributed to its source runbook.
The reasoning layer is your own coding assistant; the engine does fast, explainable, local retrieval and graph traversal. It never calls an LLM and never touches the network.
How it works
- Entity-aware ingestion — parses runbooks and extracts the nouns that connect them: services, symptoms, mitigations, and telemetry queries (KQL table/cluster/function extraction).
- Typed knowledge graph — runbooks ↔ tables ↔ services ↔ symptoms, with shortest-path traversal (to reason about cascades) and "god-node" detection (recurring failure hotspots).
- Hybrid retrieval — BM25 lexical scoring fused with graph expansion, so siblings sharing a telemetry table get surfaced and explained.
- Assistant-ready — a CLI (
query/explain/path/godnodes) and an optional MCP server expose the graph as tools.
Design constraints
| Constraint | Why |
|---|---|
| Standard library only | Runs in locked-down environments — no pip |
| No external services | No vector DB, graph DB, cloud, or LLM API |
| No data egress | Runbooks never leave the machine |
| ~600 lines, 6 modules | Easy to read, fork, and adapt to any team |
Stack
Python (standard library) · BM25 · in-process knowledge graph · KQL parsing · SQLite (optional) · MCP (optional) · MIT-licensed.