AI-Powered Observability Platform
Evaluated and prototyped an AI-powered observability platform to improve troubleshooting across distributed infrastructure and application environments.
Focused on helping developers move from fragmented VM-level investigation to centralized observability, fast log querying, detailed log and device configuration insights, trace-to-log correlation, anomaly detection, and pattern recognition.


The Problem: Isolated Logs & Limited Observability
Isolated Logs:
-
NSO logs (device trace, Java, Python, northbound logs) are stored in individual VMs only and if a VM crashes, its logs are lost, making recovery and analysis difficult.
-
No centralized platform to access and store logs from across VMs.
Limited Observability:
-
The existing observability platform had limited visibility into detailed logs and device configuration data, making root-cause analysis and troubleshooting slower.
-
No metrics on detailed NSO logs preventing improving performance (E.g. Exceptions Monitoring, Device Performance)
What I Built
Built a centralized observability prototype to solve two key problems: isolated VM-level logs and limited visibility into NSO logs, device configurations, metrics, and troubleshooting signals.
Solution Flow:
NSO Logs & Java Traces → OpenTelemetry Collector → Metadata Enrichment → Centralized Observability Platform → Log Parsing & Attribute Extraction → Created Log Views → ClickHouseQL Metrics → Exception Alerts & Anomaly Detection
The platform centralized logs from distributed VMs and captured detailed log attributes that made troubleshooting faster and more effective. It enabled fast log querying, trace-to-log correlation, structured log views, and metrics creation for exception monitoring, device-level performance analysis, and proactive troubleshooting.
Observability Workflows Implemented
1. Cisco NSO Workflow and Log Analysis
Explored NSO workflows, YANG models, NetConf/RestConf APIs, and log structures to identify observability gaps and useful troubleshooting signals.
2. Observability Requirements Definition
Defined current and long-term requirements for centralized logs, metrics, traces, device configuration insights, detailed log analysis, AI-assisted insights, and faster root-cause analysis.
3. Observability Platform Research
Compared observability platforms across usability, OpenTelemetry support, integration effort, querying, alerting, scalability, and cost.
4. Linux-Based OpenTelemetry Setup
Set up the observability pipeline on a Linux-based QA VM using OpenTelemetry Collector, service configuration, port validation, and telemetry export.
5. NSO Log Integration
Integrated NSO logs into the observability pipeline using OpenTelemetry filelog receivers and routed them into a centralized observability platform.
6. Attribute Extraction and Log Views
Parsed raw logs, extracted useful attributes, and created structured log views to support faster querying, filtering, and troubleshooting.
7. Metrics Using ClickHouseQL
Created metrics from parsed log data, including device-level RPC error rate and device request latency, to monitor reliability and performance.
8. Exception Alerts and Anomaly Detection
Configured exception alerts and anomaly detection workflows to evaluate proactive issue detection and operational feasibility.

Impact & Metrics
Cost Impact
Projected observability cost savings: ~46% with SigNoz Cloud; up to 90% with self-hosted open-source deployment.
ROI Assessment
Estimated high ROI (up to 9× vs. other observability platforms) based on ingestion volume, retention, and infrastructure overhead modeling.
Operational Efficiency
Availability of all logs and unified logs, metrics, and traces enabled faster root-cause analysis, cutting time-to-resolve by over 50% and reducing reliance on reactive customer-reported incidents.
Developer Productivity
Centralized querying and trace-to-log correlation significantly improved troubleshooting efficiency and reduced time spent context-switching across tools.
System Reliability
Improved visibility into exceptions and performance trends helped lower risk of blind spots and data loss in distributed systems.
Key Learnings
Centralized Log Retention
Centralized logs outside individual VMs to reduce data-loss risk during VM failures and make logs easier to access when needed.
OpenTelemetry Pipeline Setup
Gained hands-on experience configuring OpenTelemetry receivers, processors, exporters, and telemetry pipelines on a Linux-based environment.
Logs to Metrics
Learned how structured log attributes can be converted into metrics for error monitoring, latency tracking, and reliability analysis.
Root-Cause Analysis
Understood how fast querying, detailed log views, and trace-to-log correlation improve developer troubleshooting efficiency.
Proactive Issue Detection
Explored how alerts, anomaly detection, and pattern recognition can help identify issues earlier and reduce customer impact.