Home » Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Data, AIOps และ Observability Pipeline
Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Data, AIOps และ Observability Pipeline
Network Observability: Telemetry, OpenTelemetry, gNMI, Streaming Data, AIOps และ Observability Pipeline
Network Observability ก้าวข้ามการ monitoring แบบเดิมด้วยข้อมูลเชิงลึก Telemetry รวบรวมข้อมูลจาก network devices แบบ real-time, OpenTelemetry เป็นมาตรฐานเปิดสำหรับ observability data, gNMI เป็น gRPC-based management interface, Streaming Data ส่งข้อมูลแบบ push แทน poll, AIOps ใช้ AI วิเคราะห์ข้อมูล network และ Observability Pipeline ประมวลผลและส่งต่อข้อมูล
Network observability ต่างจาก monitoring ตรงที่: monitoring บอกว่า “อะไรพัง” แต่ observability บอกว่า “ทำไมถึงพัง” SNMP polling ทุก 5 นาที = มองเห็นแค่ snapshot แต่ streaming telemetry ทุก 1 วินาที = มองเห็น trends, anomalies, micro-bursts ที่ SNMP ไม่เห็น Modern networks (cloud, SD-WAN, 5G) ซับซ้อนเกินกว่าจะ manage ด้วย SNMP + syslog แบบเดิม → ต้องการ observability: telemetry + analytics + automation
Monitoring vs Observability
| Feature |
Traditional Monitoring |
Network Observability |
| Data Collection |
SNMP poll (pull every 5 min) |
Streaming telemetry (push every 1-10 sec) |
| Data Model |
MIB/OID (flat, vendor-specific) |
YANG models (structured, vendor-neutral) |
| Protocol |
SNMP v2c/v3, syslog |
gNMI, NETCONF, gRPC, OpenTelemetry |
| Analysis |
Threshold-based alerts (static) |
ML/AI-based anomaly detection (dynamic) |
| Scope |
“Is it up/down?” — known failure modes |
“Why is it slow?” — unknown unknowns |
| Action |
Alert → human investigates |
Alert → auto-correlate → suggest/auto-remediate |
Telemetry Types
| Type |
Data |
Protocol |
| Interface Counters |
Bytes in/out, errors, discards, utilization per interface |
gNMI, SNMP, NETCONF |
| Flow Data |
Source/dest IP, ports, protocol, bytes per flow |
NetFlow v9, IPFIX, sFlow |
| Routing State |
BGP neighbors, prefixes, route changes, convergence events |
BMP (BGP Monitoring Protocol), gNMI |
| Device Health |
CPU, memory, temperature, fan speed, power consumption |
gNMI, SNMP, REDFISH (for hardware) |
| Events/Logs |
Syslog messages, config changes, security events |
Syslog, gRPC events, streaming notifications |
| Packet Capture |
Full packet data for deep analysis |
SPAN/mirror, ERSPAN, TAP, packet broker |
gNMI (gRPC Network Management Interface)
| Feature |
Detail |
| What |
gRPC-based protocol for network device management — modern replacement for SNMP/NETCONF |
| Operations |
Get (retrieve data), Set (configure), Subscribe (streaming telemetry) |
| Subscribe Modes |
SAMPLE (periodic), ON_CHANGE (event-driven), TARGET_DEFINED (device decides) |
| Data Model |
YANG models — OpenConfig (vendor-neutral) or native vendor models |
| Encoding |
Protobuf (efficient binary) or JSON |
| vs SNMP |
Faster, structured data (YANG), TLS security, streaming (not polling), bidirectional |
| vs NETCONF |
Simpler, better for telemetry (streaming), gRPC is faster than SSH/XML |
Observability Pipeline
| Stage |
Function |
Tools |
| Collect |
Gather telemetry from all sources (gNMI, NetFlow, syslog, SNMP) |
Telegraf, gNMIc, pmacct, Logstash, FluentBit |
| Process |
Parse, enrich, filter, aggregate, normalize data |
Kafka (streaming), Vector, Cribl, OpenTelemetry Collector |
| Store |
Time-series DB for metrics, log store for events |
InfluxDB, Prometheus, Elasticsearch, ClickHouse, Loki |
| Visualize |
Dashboards, topology maps, alerting |
Grafana, Kibana, Datadog, ThousandEyes |
| Analyze |
Anomaly detection, root cause analysis, capacity planning |
AIOps: Moogsoft, BigPanda, Datadog AI, custom ML |
| Act |
Auto-remediation, ticket creation, notification |
PagerDuty, Ansible, StackStorm, custom automation |
AIOps for Networking
| Capability |
How |
Benefit |
| Anomaly Detection |
ML learns normal baseline → alerts on deviations (not static thresholds) |
Detect issues before they impact users — proactive not reactive |
| Event Correlation |
Correlate thousands of alerts into few incidents → find root cause automatically |
Reduce alert fatigue — 1,000 alerts → 5 incidents → 1 root cause |
| Predictive |
Forecast capacity, predict failures based on trends |
Plan upgrades before outages — e.g., “link will saturate in 14 days” |
| NLP Log Analysis |
Parse unstructured logs with NLP → extract actionable insights |
Find patterns in syslog messages that humans miss |
| Auto-Remediation |
Detect issue → verify → apply fix automatically (bounce port, clear BGP, restart service) |
Resolve common issues in seconds without human intervention |
ทิ้งท้าย: Observability = See, Understand, Act on Network State
Network Observability vs Monitoring: monitoring = “what’s broken” | observability = “why it’s broken” — unknown unknowns Telemetry: interface counters, flow data, routing state, device health, events — streaming (push) not polling (pull) gNMI: gRPC-based, YANG models, Subscribe (streaming) — replaces SNMP for modern telemetry Pipeline: collect (Telegraf/gNMIc) → process (Kafka/OTel) → store (InfluxDB/Prometheus) → visualize (Grafana) → act AIOps: anomaly detection (ML baseline), event correlation (1000 alerts → 1 root cause), predictive, auto-remediation Key: SNMP every 5 min = blind spots | streaming telemetry every 1 sec = full visibility — modern networks demand observability
อ่านเพิ่มเติมเกี่ยวกับ Network Monitoring SNMP NetFlow Prometheus Grafana และ Network Automation Python Netmiko NAPALM Ansible ที่ siamlancard.com หรือจาก icafeforex.com และ siam2r.com