AI-Powered Log & Observability Platform
Proposed and evaluated SigNoz as a cost-effective, AI-powered observability platform to unify logs, metrics, and traces in a single pane.
Focused on improving developer troubleshooting efficiency through fast querying, receiving detailed insights into logs, device configurations, seamless trace-to-log correlation, anomaly detection, and pattern recognition.


The Problem: Isolated Logs & Limited Observability
Isolated Logs:
-
NSO logs (device trace, Java, Python, northbound logs) are stored in individual VMs only and if a VM crashes, its logs are lost, making recovery and analysis difficult.
​​
-
No centralized platform to access and store logs from across VMs.
Limited Observability:
-
The current observability setup restricts logs & device configurations exports, preventing quick query from logs to identify root-cause and this slows down root-cause analysis and troubleshooting.
​
-
No metrics on detailed NSO logs preventing improving performance (E.g. Exceptions Monitoring, Device Performance)

My Role
-
Developed a deep understanding of Cisco NSO workflows and logging by exploring NSO flows, YANG models, NetConf/RestConf APIs, and log structures to identify observability gaps.
-
Defined current and long-term observability requirements, focusing on unified visibility across logs, metrics, and traces with AI-assisted insights.
-
Researched and compared multiple observability platforms, evaluating usability, integration effort, performance, and cost trade-offs.
​
-
Conducted a hands-on prototype setup of SigNoz, integrating NSO logs, extracting attributes for feature engineering, and configuring metrics using SQL and exception alerts to evaluate troubleshooting workflows and operational feasibility.
​
​
​
​
​​
​
​
​
​
​​
​
​
​​
​​
​​​​​​​​​​​
​
​
​
-
Evaluated ingestion latency, query performance, memory usage, and estimated cost impact.
Impact & Metrics
Cost Impact
Projected observability cost savings: ~46% with SigNoz Cloud; up to 90% with self-hosted open-source deployment.
ROI Assessment
Estimated high ROI (up to 9× vs. other observability platforms) based on ingestion volume, retention, and infrastructure overhead modeling.
Operational Efficiency
Availability of all logs and unified logs, metrics, and traces enabled faster root-cause analysis, cutting time-to-resolve by over 50% and reducing reliance on reactive customer-reported incidents.
Developer Productivity
Centralized querying and trace-to-log correlation significantly improved troubleshooting efficiency and reduced time spent context-switching across tools.
System Reliability
Improved visibility into exceptions and performance trends helped lower risk of blind spots and data loss in distributed systems.
Key Learnings
-
Observability is more than monitoring: Metrics alone are insufficient; combining logs, traces, and metrics is critical for effective root-cause analysis in distributed systems like Cisco NSO.
-
Data modeling directly impacts performance and cost: Log structure, attribute extraction, and indexing strategy significantly affect ingestion latency, query speed, memory usage, and overall observability spend.
-
Prototype validation reduces adoption risk: Hands-on prototype setups uncover integration complexity, performance bottlenecks, and operational trade-offs that are not evident from documentation or vendor comparisons alone.
-
Cost grows non-linearly with scale: Log volume, retention policies, and cardinality are primary cost drivers, making open-source and self-hosted options attractive for high-throughput systems.
-
Unified observability improves developer efficiency: Trace-to-log correlation and centralized querying reduce context switching and enable earlier issue detection before customer impact.
-
Product decisions require technical depth: Effective platform evaluation requires balancing usability, scalability, performance, and cost, not just feature completeness.