Network Investigation Agentic Platform

The problem

In large enterprise infrastructure environments, when a customer hits a routing, BGP or WAN issue, engineers have to manually pivot across many database clusters, disconnected telemetry systems, runbooks and dashboards — often spending 30 to 90 minutes correlating signal across systems before they can even form a hypothesis.

What I built

A full-stack, AI-native operational platform that brings network diagnostics, telemetry intelligence, incident management, analytics and AI troubleshooting into a single workspace.

I architected and led engineering on the platform end-to-end:

Autonomous diagnostic agent (GPT-5 + Claude Opus) — understands natural-language troubleshooting requests, selects the right database clusters, chains diagnostic workflows in sequence, correlates telemetry across systems and produces structured TSG-grade summaries.
MCP tooling layer — 200+ diagnostic tools organized as a structured, AI-callable hierarchy spanning every operational domain.
Operational dashboards — incident analytics, risk assessment, topology intelligence, incident-quality metrics, SLA monitoring and executive summary views.
Conversational ops bot — multi-turn conversations, adaptive cards, change-request automation, HTML-rendered operational responses.

Technical highlights

Metric	Value
Production code	100K+ lines
Database / telemetry clusters	19
Validated diagnostic tools	180+
MCP tools surfaced to agents	200+
React components	130+
Operational APIs	70+
Multi-tier Redis caching	yes
Parallel cross-cluster querying	yes
OpenTelemetry tracing	yes
Multi-tenant SSO / OIDC auth	yes

Stack

Python · FastAPI · React 18 · TypeScript · GPT-5 · Claude Opus · MCP · Database clusters · Redis · Serverless Functions · Cloud Identity · Docker · OpenTelemetry.

Business impact

Troubleshooting time: 30–90 min → 1–3 min.
10–30× faster incident diagnostics.
Thousands of engineering hours saved annually.
Standardized troubleshooting methodology across the org.
Institutional operational knowledge encoded into AI workflows.
Significantly faster onboarding for junior engineers.

Engineering philosophy

The platform succeeds because the agents only succeed: well-shaped MCP tools, clean cluster routing, encoded runbooks. Most enterprise AI bets fail because the tools given to the agents are weak, not the models.