Home » Network Monitoring: SNMP, NetFlow, sFlow, Syslog, Prometheus, Grafana และ Observability
Network Monitoring: SNMP, NetFlow, sFlow, Syslog, Prometheus, Grafana และ Observability
Network Monitoring: SNMP, NetFlow, sFlow, Syslog, Prometheus, Grafana และ Observability
Network Monitoring ติดตามสถานะและ performance ของ network อย่างต่อเนื่อง SNMP ดึงข้อมูล device metrics, NetFlow วิเคราะห์ traffic flows, sFlow ใช้ sampling สำหรับ high-speed networks, Syslog รวบรวม logs จากทุก device, Prometheus เก็บ time-series metrics, Grafana สร้าง dashboards และ Observability รวมทุกอย่างเข้าด้วยกัน
Network monitoring ที่ดีต้อง ตรวจจับปัญหาก่อนที่ users จะรู้สึก: interface utilization 80%+ → alert ก่อน saturate, increasing error count → fix ก่อน outage, BGP session flap → investigate ก่อน route loss “You can’t fix what you can’t see” — 60% ของ MTTR (Mean Time To Resolve) คือเวลาที่ใช้หาปัญหา ไม่ใช่แก้ปัญหา monitoring ที่ดีลด MTTR 50-70%
Monitoring Methods
| Method |
Data Type |
How |
Best For |
| SNMP |
Device metrics (CPU, memory, interface stats) |
Poll devices periodically (GET) or receive traps (TRAP) |
Infrastructure health, capacity planning |
| NetFlow/IPFIX |
Traffic flows (src/dst IP, port, protocol, bytes) |
Router/switch exports flow records to collector |
Traffic analysis, security, bandwidth billing |
| sFlow |
Sampled packets + counters |
Sample 1-in-N packets + export counters |
High-speed networks, real-time visibility |
| Syslog |
Log messages (events, errors, warnings) |
Devices send logs to centralized syslog server |
Event correlation, troubleshooting, compliance |
| Streaming Telemetry |
Real-time metrics (model-driven) |
Device pushes data via gRPC/gNMI (YANG models) |
Modern monitoring, high-frequency data, automation |
| Synthetic Monitoring |
Active probes (ICMP, HTTP, DNS) |
Generate test traffic → measure response |
SLA verification, user experience simulation |
SNMP
| Feature |
รายละเอียด |
| Versions |
v1 (basic, insecure), v2c (community string, bulk get), v3 (authentication + encryption — use this) |
| GET |
NMS polls device → device returns requested OID value (e.g., ifInOctets for interface bytes in) |
| WALK |
Walk through MIB tree → retrieve multiple OIDs sequentially |
| TRAP |
Device sends unsolicited alert to NMS → link down, CPU high, fan fail |
| INFORM |
Like TRAP but with acknowledgment → reliable delivery (v2c/v3) |
| MIB |
Management Information Base: defines OIDs (object identifiers) for device data |
| Polling Interval |
Typical: 5 minutes (balance between granularity and device load) |
NetFlow / IPFIX
| Feature |
รายละเอียด |
| Flow |
Unidirectional packet stream with same: src/dst IP, src/dst port, protocol, ToS, input interface |
| NetFlow v5 |
Fixed format, IPv4 only — legacy but widely supported |
| NetFlow v9 |
Template-based, flexible fields — supports IPv6, MPLS labels |
| IPFIX |
IETF standard (based on NetFlow v9) — vendor-neutral, extensible templates |
| Collector |
Receives flow records: ntopng, Elasticsearch, SolarWinds NTA, ManageEngine NetFlow Analyzer |
| Use Cases |
Top talkers, application bandwidth, DDoS detection, forensics, capacity planning |
| Impact |
Minimal CPU impact on modern routers (hardware-based flow cache) |
sFlow
| Feature |
รายละเอียด |
| How |
Sample 1 packet out of every N (e.g., 1-in-1000) + export interface counters periodically |
| Advantage |
Scales to 100G+ links (sampling = constant CPU regardless of traffic), real-time, vendor-neutral |
| vs NetFlow |
sFlow = sampling (approximate) | NetFlow = every flow (exact but more CPU/memory) |
| Accuracy |
Statistically accurate for high-volume flows → less accurate for small flows (may miss) |
| Collector |
sFlow-RT (real-time), InMon, ntopng, Kentik |
| Support |
Wide vendor support: Arista, Dell, HP, Juniper, Mellanox — standard on data center switches |
Prometheus + Grafana
| Component |
Function |
| Prometheus |
Time-series database + scraping engine: pull metrics from exporters at regular intervals |
| SNMP Exporter |
Convert SNMP data → Prometheus metrics format → scrape network devices |
| Node Exporter |
Linux server metrics: CPU, memory, disk, network interfaces |
| Blackbox Exporter |
Active probing: ICMP ping, HTTP checks, DNS, TCP → synthetic monitoring |
| PromQL |
Query language: rate(), increase(), histogram_quantile() → powerful metric analysis |
| AlertManager |
Define alert rules → route to: email, Slack, PagerDuty, webhook |
| Grafana |
Visualization: dashboards, graphs, tables, heatmaps → query Prometheus + other data sources |
Observability Stack
| Pillar |
Data |
Tools |
| Metrics |
Numeric measurements over time (CPU 85%, latency 12ms, throughput 500Mbps) |
Prometheus, InfluxDB, Datadog, PRTG |
| Logs |
Structured/unstructured event records (syslog, application logs) |
ELK Stack (Elasticsearch + Logstash + Kibana), Loki, Splunk |
| Traces |
Request flow through distributed system (service A → B → C) |
Jaeger, Zipkin, OpenTelemetry, Datadog APM |
| Events |
Change events: config change, deployment, BGP flap, interface down |
Event correlation, CMDB integration |
Monitoring Best Practices
| Practice |
Detail |
| Baseline |
Establish normal baselines → alert on deviations (not just thresholds) |
| Alert Tiers |
Warning (investigate) → Critical (act now) → Emergency (all hands) — avoid alert fatigue |
| Dashboard Design |
Overview → drill-down: site overview → device → interface → flow analysis |
| Retention |
High-res (5min) for 7 days → medium (1hr) for 90 days → low (1day) for 2+ years |
| SNMPv3 |
Always use SNMPv3 with authentication + encryption — never v1/v2c in production |
| Automation |
Auto-remediation: interface down → check, restart → escalate if persists |
ทิ้งท้าย: Monitoring = Eyes and Ears of Your Network
Network Monitoring SNMP: poll device metrics (v3 with auth/encryption), GET/WALK/TRAP, 5-min interval NetFlow/IPFIX: traffic flow analysis — top talkers, app bandwidth, security, DDoS detection sFlow: packet sampling (1-in-N) — scales to 100G+, real-time, data center standard Syslog: centralized log collection — event correlation, troubleshooting, compliance Prometheus + Grafana: open-source metrics + visualization, SNMP/Blackbox exporters, AlertManager Observability: metrics + logs + traces + events — ELK for logs, Prometheus for metrics, OpenTelemetry for traces Key: 60% of MTTR is finding the problem — good monitoring reduces MTTR 50-70%, prevent outages proactively
อ่านเพิ่มเติมเกี่ยวกับ Network Performance Latency Jitter Throughput QoE และ Network Troubleshooting Methodology OSI Wireshark ที่ siamlancard.com หรือจาก icafeforex.com และ siam2r.com