Runbook GraphRAG

The problem

During an incident, responders burn time hunting across dozens of runbooks and troubleshooting guides — and the ones that matter are often linked in ways plain keyword search can't see. An auth-throttling event shows up as payment timeouts; two guides are related because they query the same telemetry table, not because they share words. Flat similarity search misses exactly those cascade paths.

What I built

A small, hackable graph-RAG engine that turns a folder of runbooks into a queryable knowledge graph. Given an incident description, it returns the most relevant guides plus the graph-linked siblings flat search would miss — along with the exact telemetry queries to run and the documented mitigations, each attributed to its source runbook.

The reasoning layer is your own coding assistant; the engine does fast, explainable, local retrieval and graph traversal. It never calls an LLM and never touches the network.

How it works

Entity-aware ingestion — parses runbooks and extracts the nouns that connect them: services, symptoms, mitigations, and telemetry queries (KQL table/cluster/function extraction).
Typed knowledge graph — runbooks ↔ tables ↔ services ↔ symptoms, with shortest-path traversal (to reason about cascades) and "god-node" detection (recurring failure hotspots).
Hybrid retrieval — BM25 lexical scoring fused with graph expansion, so siblings sharing a telemetry table get surfaced and explained.
Assistant-ready — a CLI (query / explain / path / godnodes) and an optional MCP server expose the graph as tools.

Design constraints

Constraint	Why
Standard library only	Runs in locked-down environments — no `pip`
No external services	No vector DB, graph DB, cloud, or LLM API
No data egress	Runbooks never leave the machine
~600 lines, 6 modules	Easy to read, fork, and adapt to any team

Stack

Python (standard library) · BM25 · in-process knowledge graph · KQL parsing · SQLite (optional) · MCP (optional) · MIT-licensed.