AI-Powered Observability Platform

Evaluated and prototyped an AI-powered observability platform to improve troubleshooting across distributed infrastructure and application environments.

Focused on helping developers move from fragmented VM-level investigation to centralized observability, fast log querying, detailed log and device configuration insights, trace-to-log correlation, anomaly detection, and pattern recognition.

Screenshot 2026-01-16 at 10.55.14 PM.png

The Problem: Isolated Logs & Limited Observability

Isolated Logs:

NSO logs (device trace, Java, Python, northbound logs) are stored in individual VMs only and if a VM crashes, its logs are lost, making recovery and analysis difficult.

No centralized platform to access and store logs from across VMs.

Limited Observability:

The existing observability platform had limited visibility into detailed logs and device configuration data, making root-cause analysis and troubleshooting slower.

No metrics on detailed NSO logs preventing improving performance (E.g. Exceptions Monitoring, Device Performance)

What I Built

Built a centralized observability prototype to solve two key problems: isolated VM-level logs and limited visibility into NSO logs, device configurations, metrics, and troubleshooting signals.

Solution Flow:
NSO Logs & Java Traces → OpenTelemetry Collector → Metadata Enrichment → Centralized Observability Platform → Log Parsing & Attribute Extraction → Created Log Views → ClickHouseQL Metrics → Exception Alerts & Anomaly Detection

The platform centralized logs from distributed VMs and captured detailed log attributes that made troubleshooting faster and more effective. It enabled fast log querying, trace-to-log correlation, structured log views, and metrics creation for exception monitoring, device-level performance analysis, and proactive troubleshooting.

Observability Workflows Implemented

1. Cisco NSO Workflow and Log Analysis

Explored NSO workflows, YANG models, NetConf/RestConf APIs, and log structures to identify observability gaps and useful troubleshooting signals.

2. Observability Requirements Definition

Defined current and long-term requirements for centralized logs, metrics, traces, device configuration insights, detailed log analysis, AI-assisted insights, and faster root-cause analysis.

3. Observability Platform Research

Compared observability platforms across usability, OpenTelemetry support, integration effort, querying, alerting, scalability, and cost.

4. Linux-Based OpenTelemetry Setup

Set up the observability pipeline on a Linux-based QA VM using OpenTelemetry Collector, service configuration, port validation, and telemetry export.

5. NSO Log Integration

Integrated NSO logs into the observability pipeline using OpenTelemetry filelog receivers and routed them into a centralized observability platform.

6. Attribute Extraction and Log Views

Parsed raw logs, extracted useful attributes, and created structured log views to support faster querying, filtering, and troubleshooting.

7. Metrics Using ClickHouseQL

Created metrics from parsed log data, including device-level RPC error rate and device request latency, to monitor reliability and performance.

8. Exception Alerts and Anomaly Detection

Configured exception alerts and anomaly detection workflows to evaluate proactive issue detection and operational feasibility.

Impact & Metrics

Cost Impact

Projected observability cost savings: ~46% with SigNoz Cloud; up to 90% with self-hosted open-source deployment.

ROI Assessment

Estimated high ROI (up to 9× vs. other observability platforms) based on ingestion volume, retention, and infrastructure overhead modeling.

Operational Efficiency

Availability of all logs and unified logs, metrics, and traces enabled faster root-cause analysis, cutting time-to-resolve by over 50% and reducing reliance on reactive customer-reported incidents.

Developer Productivity

Centralized querying and trace-to-log correlation significantly improved troubleshooting efficiency and reduced time spent context-switching across tools.

System Reliability

Improved visibility into exceptions and performance trends helped lower risk of blind spots and data loss in distributed systems.

Key Learnings

Centralized Log Retention

Centralized logs outside individual VMs to reduce data-loss risk during VM failures and make logs easier to access when needed.

OpenTelemetry Pipeline Setup

Gained hands-on experience configuring OpenTelemetry receivers, processors, exporters, and telemetry pipelines on a Linux-based environment.

Logs to Metrics

Learned how structured log attributes can be converted into metrics for error monitoring, latency tracking, and reliability analysis.

Root-Cause Analysis

Understood how fast querying, detailed log views, and trace-to-log correlation improve developer troubleshooting efficiency.

Proactive Issue Detection

Explored how alerts, anomaly detection, and pattern recognition can help identify issues earlier and reduce customer impact.