AI-Powered Log & Observability Platform

Proposed and evaluated SigNoz as a cost-effective, AI-powered observability platform to unify logs, metrics, and traces in a single pane.

Focused on improving developer troubleshooting efficiency through fast querying, receiving detailed insights into logs, device configurations, seamless trace-to-log correlation, anomaly detection, and pattern recognition.

Screenshot 2026-01-16 at 10.55.14 PM.png

The Problem: Isolated Logs & Limited Observability

Isolated Logs:

NSO logs (device trace, Java, Python, northbound logs) are stored in individual VMs only and if a VM crashes, its logs are lost, making recovery and analysis difficult.

No centralized platform to access and store logs from across VMs.

Limited Observability:

The current observability setup restricts logs & device configurations exports, preventing quick query from logs to identify root-cause and this slows down root-cause analysis and troubleshooting.

No metrics on detailed NSO logs preventing improving performance (E.g. Exceptions Monitoring, Device Performance)

Screenshot 2026-01-17 at 12.40.55 PM.png

My Role

Developed a deep understanding of Cisco NSO workflows and logging by exploring NSO flows, YANG models, NetConf/RestConf APIs, and log structures to identify observability gaps.

Defined current and long-term observability requirements, focusing on unified visibility across logs, metrics, and traces with AI-assisted insights.

Researched and compared multiple observability platforms, evaluating usability, integration effort, performance, and cost trade-offs.

Conducted a hands-on prototype setup of SigNoz, integrating NSO logs, extracting attributes for feature engineering, and configuring metrics using SQL and exception alerts to evaluate troubleshooting workflows and operational feasibility.

Evaluated ingestion latency, query performance, memory usage, and estimated cost impact.

Impact & Metrics

Cost Impact

Projected observability cost savings: ~46% with SigNoz Cloud; up to 90% with self-hosted open-source deployment.

ROI Assessment

Estimated high ROI (up to 9× vs. other observability platforms) based on ingestion volume, retention, and infrastructure overhead modeling.

Operational Efficiency

Availability of all logs and unified logs, metrics, and traces enabled faster root-cause analysis, cutting time-to-resolve by over 50% and reducing reliance on reactive customer-reported incidents.

Developer Productivity

Centralized querying and trace-to-log correlation significantly improved troubleshooting efficiency and reduced time spent context-switching across tools.

System Reliability

Improved visibility into exceptions and performance trends helped lower risk of blind spots and data loss in distributed systems.

Key Learnings

Observability is more than monitoring: Metrics alone are insufficient; combining logs, traces, and metrics is critical for effective root-cause analysis in distributed systems like Cisco NSO.

Data modeling directly impacts performance and cost: Log structure, attribute extraction, and indexing strategy significantly affect ingestion latency, query speed, memory usage, and overall observability spend.

Prototype validation reduces adoption risk: Hands-on prototype setups uncover integration complexity, performance bottlenecks, and operational trade-offs that are not evident from documentation or vendor comparisons alone.

Cost grows non-linearly with scale: Log volume, retention policies, and cardinality are primary cost drivers, making open-source and self-hosted options attractive for high-throughput systems.

Unified observability improves developer efficiency: Trace-to-log correlation and centralized querying reduce context switching and enable earlier issue detection before customer impact.

Product decisions require technical depth: Effective platform evaluation requires balancing usability, scalability, performance, and cost, not just feature completeness.