The problem
In large enterprise infrastructure environments, when a customer hits a routing, BGP or WAN issue, engineers have to manually pivot across many database clusters, disconnected telemetry systems, runbooks and dashboards — often spending 30 to 90 minutes correlating signal across systems before they can even form a hypothesis.
What I built
A full-stack, AI-native operational platform that brings network diagnostics, telemetry intelligence, incident management, analytics and AI troubleshooting into a single workspace.
I architected and led engineering on the platform end-to-end:
- Autonomous diagnostic agent (GPT-5 + Claude Opus) — understands natural-language troubleshooting requests, selects the right database clusters, chains diagnostic workflows in sequence, correlates telemetry across systems and produces structured TSG-grade summaries.
- MCP tooling layer — 200+ diagnostic tools organized as a structured, AI-callable hierarchy spanning every operational domain.
- Operational dashboards — incident analytics, risk assessment, topology intelligence, incident-quality metrics, SLA monitoring and executive summary views.
- Conversational ops bot — multi-turn conversations, adaptive cards, change-request automation, HTML-rendered operational responses.
Technical highlights
| Metric | Value |
|---|---|
| Production code | 100K+ lines |
| Database / telemetry clusters | 19 |
| Validated diagnostic tools | 180+ |
| MCP tools surfaced to agents | 200+ |
| React components | 130+ |
| Operational APIs | 70+ |
| Multi-tier Redis caching | yes |
| Parallel cross-cluster querying | yes |
| OpenTelemetry tracing | yes |
| Multi-tenant SSO / OIDC auth | yes |
Stack
Python · FastAPI · React 18 · TypeScript · GPT-5 · Claude Opus · MCP · Database clusters · Redis · Serverless Functions · Cloud Identity · Docker · OpenTelemetry.
Business impact
- Troubleshooting time: 30–90 min → 1–3 min.
- 10–30× faster incident diagnostics.
- Thousands of engineering hours saved annually.
- Standardized troubleshooting methodology across the org.
- Institutional operational knowledge encoded into AI workflows.
- Significantly faster onboarding for junior engineers.
Engineering philosophy
The platform succeeds because the agents only succeed: well-shaped MCP tools, clean cluster routing, encoded runbooks. Most enterprise AI bets fail because the tools given to the agents are weak, not the models.