Home » Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Analytics และ AIOps
Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Analytics และ AIOps
Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Analytics และ AIOps
Network Observability ก้าวข้าม traditional monitoring ด้วยการเข้าใจ why ไม่ใช่แค่ what Telemetry รวบรวม data จาก network devices แบบ real-time, OpenTelemetry เป็น open standard สำหรับ traces, metrics, logs, gNMI ให้ model-driven streaming telemetry จาก network devices, Streaming Analytics ประมวลผล data แบบ real-time และ AIOps ใช้ AI/ML เพื่อ automate operations
Traditional monitoring ใช้ SNMP polling ทุก 5 นาที: ไม่เห็น events ระหว่าง polls, ไม่ scale กับ modern networks ที่มี thousands of devices Streaming telemetry push data จาก devices แบบ real-time (sub-second) → เห็นทุกอย่างทันที AIOps analyze data volumes ที่มนุษย์ไม่สามารถ process ได้ → detect anomalies, predict failures, automate remediation
Monitoring vs Observability
| Feature |
Monitoring |
Observability |
| Approach |
Poll devices periodically (reactive) |
Stream data continuously (proactive) |
| Question |
“Is it up or down?” (known unknowns) |
“Why is it behaving this way?” (unknown unknowns) |
| Data |
Pre-defined metrics (CPU, memory, interface) |
Rich telemetry (flows, traces, logs, metrics — correlated) |
| Frequency |
5-minute polling intervals (SNMP) |
Sub-second streaming (gNMI, NETCONF) |
| Scale |
Limited by polling overhead |
Push-based → scales better with many devices |
| Analysis |
Threshold-based alerts (static) |
ML-based anomaly detection (dynamic baselines) |
Telemetry Types
| Type |
What |
Use |
| Metrics |
Numeric measurements over time (CPU 85%, bandwidth 1.2 Gbps) |
Performance monitoring, capacity planning, alerting |
| Logs |
Timestamped event records (syslog, structured logs) |
Troubleshooting, audit trail, security events |
| Traces |
End-to-end request path across components |
Latency analysis, dependency mapping, root cause |
| Flows |
Network traffic metadata (NetFlow, sFlow, IPFIX) |
Traffic analysis, security (anomaly detection), capacity |
| Topology |
Network graph (nodes, links, relationships) |
Impact analysis, path visualization, change detection |
gNMI (gRPC Network Management Interface)
| Feature |
รายละเอียด |
| คืออะไร |
gRPC-based protocol สำหรับ streaming telemetry จาก network devices |
| Model-Driven |
ใช้ YANG data models → structured, consistent data across vendors |
| Operations |
Get (read), Set (write), Subscribe (stream) — ครบทั้ง read/write/stream |
| Subscribe Modes |
SAMPLE (periodic), ON_CHANGE (event-driven), TARGET_DEFINED (device decides) |
| Encoding |
Protobuf (efficient binary) หรือ JSON |
| vs SNMP |
gNMI: push-based, structured (YANG), efficient (gRPC/protobuf), TLS encrypted |
| vs NETCONF |
gNMI: lighter weight, better for telemetry streaming | NETCONF: better for config management |
| Support |
Cisco IOS-XR/XE, Arista EOS, Juniper Junos, Nokia SR OS |
OpenTelemetry (OTel)
| Feature |
รายละเอียด |
| คืออะไร |
CNCF open standard สำหรับ collecting, processing, exporting telemetry data |
| Signals |
Traces + Metrics + Logs (unified framework สำหรับ 3 pillars) |
| Collector |
OTel Collector: receive, process, export telemetry (vendor-agnostic pipeline) |
| SDKs |
Libraries สำหรับทุกภาษา (Python, Go, Java, .NET, JS) → instrument applications |
| OTLP |
OpenTelemetry Protocol — standard wire format สำหรับ telemetry data |
| Backends |
Export ไป Prometheus, Grafana, Jaeger, Datadog, Splunk, Elastic — vendor-agnostic |
| Network |
OTel expanding เข้า network domain (network-specific receivers ใน collector) |
Streaming Analytics Pipeline
| Stage |
Component |
Tools |
| 1. Collect |
Ingest telemetry from devices |
gNMI, SNMP, syslog, NetFlow, API polling |
| 2. Transport |
Message queue/stream processing |
Kafka, NATS, gRPC streams |
| 3. Process |
Transform, enrich, correlate |
Apache Flink, ksqlDB, Telegraf, OTel Collector |
| 4. Store |
Time-series database, data lake |
Prometheus, InfluxDB, TimescaleDB, Elasticsearch |
| 5. Visualize |
Dashboards, alerts, reports |
Grafana, Kibana, custom dashboards |
| 6. Act |
Automated response, remediation |
Ansible, scripts, API calls (closed-loop automation) |
AIOps for Networking
| Capability |
How |
Benefit |
| Anomaly Detection |
ML learns normal baseline → detect deviations automatically |
No static thresholds → catches unknown issues |
| Root Cause Analysis |
Correlate events across devices/layers → identify root cause |
Reduce MTTR from hours to minutes |
| Predictive |
Forecast failures before they happen (disk, link, capacity) |
Prevent outages → proactive maintenance |
| Event Correlation |
Group related alerts → reduce alert noise (1000 alerts → 5 incidents) |
Alert fatigue reduction 80-90% |
| Capacity Planning |
Predict when capacity will be exhausted based on growth trends |
Budget and procure before hitting limits |
| Auto-Remediation |
Detect issue → run playbook → fix automatically (closed loop) |
Self-healing network (restart service, reroute traffic) |
AIOps Platforms
| Platform |
Focus |
| Cisco ThousandEyes |
Internet/cloud path visibility, DEM |
| Juniper Mist AI |
AI-driven wireless + wired + WAN (Marvis virtual assistant) |
| Datadog |
Full-stack observability (infra, APM, network, logs) |
| Splunk ITSI |
IT service intelligence, event correlation, ML analytics |
| Elastic Observability |
Open-source based (ELK stack), logs + metrics + traces |
| Kentik |
Network-specific observability (flow, SNMP, streaming, BGP) |
ทิ้งท้าย: Observability = See, Understand, Act
Network Observability vs Monitoring: observability asks “why” not just “what” — proactive, rich data, ML-based Telemetry: metrics + logs + traces + flows + topology (5 pillars) gNMI: streaming telemetry (push-based, YANG models, gRPC, sub-second, replaces SNMP polling) OpenTelemetry: CNCF standard (traces + metrics + logs), OTel Collector, vendor-agnostic Pipeline: collect → transport (Kafka) → process (Flink) → store (Prometheus) → visualize (Grafana) → act AIOps: anomaly detection, root cause analysis, predictive, event correlation, auto-remediation Key: move from reactive polling to proactive streaming → understand network behavior, not just status
อ่านเพิ่มเติมเกี่ยวกับ Network Monitoring SNMP NetFlow Telemetry และ Network Automation Python Netmiko Paramiko ที่ siamlancard.com หรือจาก icafeforex.com และ siam2r.com