Prometheus และ Node Exporter สอนติดตั้ง Server Monitoring แบบ Open Source 2026

April 10, 2026

0 Views

SaveSavedRemoved 0

บทนำ: ทำไม IT Ops ยุค 2026 ต้องมี Monitoring ที่เป็นระบบ

ในยุค 2026 ที่ระบบ IT Infrastructure มีความซับซ้อนมากขึ้นเรื่อยๆ ไม่ว่าจะเป็น Physical Server, Virtual Machine, Container หรือ Cloud Service การ Monitor ระบบเหล่านี้ไม่ใช่ทางเลือกอีกต่อไป แต่เป็นสิ่งจำเป็นที่ขาดไม่ได้ การที่ Server ล่มแล้วไม่รู้ตัวจนกว่าจะมีคนโทรมาบอก หรือ Disk เต็มจนระบบหยุดทำงานกลางดึก เป็นสถานการณ์ที่ผู้ดูแลระบบทุกคนไม่อยากเจอ Prometheus เป็น Open Source Monitoring System ที่ถูกพัฒนาโดย SoundCloud และปัจจุบันอยู่ภายใต้ Cloud Native Computing Foundation (CNCF) เช่นเดียวกับ Kubernetes สิ่งที่ทำให้ Prometheus โดดเด่นคือ Architecture ที่เรียบง่ายแต่ทรงพลัง Pull-based Model ที่ไปดึง Metrics จาก Target เอง ภาษา PromQL ที่ให้ความยืดหยุ่นในการ Query ข้อมูลอย่างมาก และ Ecosystem ที่กว้างขวางด้วย Exporter นับร้อยตัว

บทความนี้เขียนจากมุมมองของ IT Operations ไม่ใช่ DevOps หรือ SRE หมายความว่าเนื้อหาจะเน้นไปที่การ Monitor Server และ Infrastructure แบบดั้งเดิมที่ Sysadmin ต้องดูแล เช่น CPU, Memory, Disk, Network, Service Status และ Hardware Health ไม่ได้เน้นเรื่อง Kubernetes หรือ Microservices ซึ่งเป็น Use Case ที่ Prometheus เป็นที่รู้จักมากกว่า แต่สำหรับ Sysadmin ที่ดูแล Windows Server, Linux Server และ Network Device ในองค์กร Prometheus ก็เป็นเครื่องมือที่ยอดเยี่ยมเช่นกัน

Prometheus Architecture: เข้าใจก่อนติดตั้ง

Pull-based Model

Prometheus ใช้ Pull-based Model หมายความว่า Prometheus Server จะเป็นฝ่ายไปดึง (Scrape) Metrics จาก Target ตามช่วงเวลาที่กำหนด (Default 15 วินาที) ซึ่งแตกต่างจาก Monitoring System อื่นๆ เช่น Zabbix หรือ Nagios ที่ใช้ทั้ง Push และ Pull ข้อดีของ Pull-based Model คือ Prometheus สามารถ Detect ได้ทันทีว่า Target ล่มเมื่อ Scrape ไม่สำเร็จ ไม่ต้อง Configure Client ให้ส่ง Metrics มาที่ Prometheus (ลด Configuration ที่ Client ฝั่ง) สามารถควบคุม Scrape Interval ได้จากจุดเดียว และง่ายต่อการ Debug เพราะสามารถเข้าไปดู Metrics ที่ Target Expose ได้โดยตรงผ่าน Browser

Components หลักของ Prometheus

Prometheus Ecosystem ประกอบด้วย Components หลักๆ ดังนี้ Prometheus Server เป็นหัวใจของระบบ ทำหน้าที่ Scrape Metrics, เก็บข้อมูลใน Time Series Database (TSDB) และประมวลผล PromQL Queries Exporters เป็น Agent ที่ติดตั้งบน Target เพื่อ Expose Metrics ในรูปแบบที่ Prometheus เข้าใจ เช่น Node Exporter สำหรับ Linux, WMI Exporter สำหรับ Windows Alertmanager เป็น Component ที่จัดการ Alerts รับ Alert จาก Prometheus แล้วส่ง Notification ผ่าน Email, Slack, LINE, PagerDuty เป็นต้น Pushgateway เป็น Component สำหรับรับ Metrics แบบ Push จาก Short-lived Jobs เช่น Batch Script หรือ Cron Job และ Grafana แม้จะไม่ใช่ส่วนหนึ่งของ Prometheus โดยตรง แต่เป็น Dashboard ที่ใช้คู่กับ Prometheus เป็นมาตรฐาน

Time Series Database (TSDB)

Prometheus เก็บข้อมูลในรูปแบบ Time Series ซึ่งเป็นข้อมูลที่มี Timestamp กำกับ ทุก Metric จะมี Name และ Labels เช่น node_cpu_seconds_total{cpu=”0″, mode=”idle”} หมายถึง CPU เวลาที่ใช้ใน Idle Mode ของ CPU Core 0 TSDB ของ Prometheus ถูกออกแบบมาเพื่อ Write-heavy Workload สามารถ Ingest ข้อมูลได้หลายแสน Samples ต่อวินาทีบน Hardware ทั่วไป Default Retention คือ 15 วัน แต่สามารถปรับได้ตามต้องการ สิ่งที่ต้องระวังคือ Disk Usage จะเพิ่มขึ้นตามจำนวน Time Series (Cardinality) ไม่ใช่จำนวน Target ดังนั้นถ้ามี Label ที่มี Value เปลี่ยนบ่อย (High Cardinality) จะทำให้ Disk Usage พุ่งสูงมาก

การติดตั้ง Prometheus

ติดตั้งด้วย Docker (แนะนำสำหรับเริ่มต้น)

การติดตั้ง Prometheus ด้วย Docker เป็นวิธีที่ง่ายและเร็วที่สุด เหมาะสำหรับการทดลองใช้งานและ Production ที่ใช้ Container สิ่งที่ต้องมีคือ Docker และ Docker Compose ขั้นตอนแรกสร้าง docker-compose.yml ที่กำหนด Prometheus Service พร้อม Volume สำหรับเก็บ Configuration และ Data จากนั้นสร้าง prometheus.yml สำหรับ Configuration หลัก แล้วรัน docker-compose up -d ก็จะได้ Prometheus Server พร้อมใช้งาน สามารถเข้าถึง Web UI ได้ที่ Port 9090 ข้อดีของการใช้ Docker คือ Upgrade ง่าย แค่เปลี่ยน Image Tag, Rollback ง่ายถ้ามีปัญหา และ Isolate จาก Host OS

ติดตั้งแบบ Bare Metal บน Linux

สำหรับ Production Environment ที่ต้องการ Performance สูงสุดหรือไม่ต้องการ Docker Overhead การติดตั้งแบบ Bare Metal เป็นทางเลือกที่ดี ขั้นตอนการติดตั้งเริ่มจาก สร้าง User สำหรับรัน Prometheus (ไม่ควรรันด้วย root) ดาวน์โหลด Binary จาก GitHub Release ของ Prometheus คลายไฟล์แล้วย้าย Binary ไปไว้ที่ /usr/local/bin/ สร้าง Directory สำหรับ Configuration (/etc/prometheus/) และ Data (/var/lib/prometheus/) สร้าง prometheus.yml Configuration File สร้าง Systemd Service File สำหรับจัดการ Service แล้ว Enable และ Start Service สิ่งสำคัญคือต้องตั้ง Storage Retention ให้เหมาะสมกับ Disk Space ที่มี โดยใช้ Flag –storage.tsdb.retention.time สำหรับกำหนดจำนวนวันที่เก็บข้อมูล หรือ –storage.tsdb.retention.size สำหรับกำหนดขนาด Disk สูงสุดที่ใช้

Prometheus Configuration: prometheus.yml

prometheus.yml เป็นไฟล์ Configuration หลักของ Prometheus ประกอบด้วยส่วนสำคัญดังนี้ global คือ Default Configuration ที่ใช้กับทุก Scrape Job เช่น scrape_interval (ความถี่ในการ Scrape ค่า Default 15 วินาที), evaluation_interval (ความถี่ในการ Evaluate Alert Rules), external_labels (Labels ที่เพิ่มให้ทุก Time Series) scrape_configs คือรายการ Scrape Job ที่กำหนดว่า Prometheus จะ Scrape Metrics จาก Target ไหนบ้าง แต่ละ Job มี job_name, static_configs หรือ Service Discovery, scrape_interval (Override global), metrics_path (Default /metrics), scheme (http หรือ https) rule_files คือรายการไฟล์ที่เก็บ Recording Rules และ Alerting Rules alerting คือ Configuration สำหรับเชื่อมต่อกับ Alertmanager ตัวอย่าง Configuration พื้นฐานที่ Monitor ตัว Prometheus เองและ Node Exporter 2 เครื่องอาจมี scrape_configs สำหรับ job prometheus ที่ target localhost:9090 และ job node สำหรับ target เครื่อง Server ที่ติดตั้ง Node Exporter

Service Discovery

สำหรับ Environment ที่มี Server จำนวนมากหรือมีการเพิ่มลดบ่อย การใช้ static_configs อาจไม่สะดวก Prometheus รองรับ Service Discovery หลายรูปแบบ เช่น File-based Service Discovery ที่อ่าน Target จากไฟล์ JSON หรือ YAML ที่ Update ได้โดยไม่ต้อง Restart Prometheus DNS Service Discovery ที่ค้นหา Target จาก DNS Records Consul Service Discovery สำหรับ Environment ที่ใช้ Consul EC2 Service Discovery สำหรับ Auto-discover EC2 Instances บน AWS และ Kubernetes Service Discovery สำหรับ Auto-discover Pods และ Services บน Kubernetes สำหรับ Sysadmin ที่ดูแล Server แบบ Traditional แนะนำให้ใช้ File-based Service Discovery เพราะง่ายที่สุดและสามารถ Integrate กับ CMDB หรือ Inventory System ที่มีอยู่ได้

Node Exporter: Metrics จาก Linux Server

การติดตั้ง Node Exporter

Node Exporter เป็น Exporter อย่างเป็นทางการสำหรับ Linux Server ทำหน้าที่ Expose Hardware และ OS Metrics ผ่าน HTTP Endpoint (Default Port 9100) การติดตั้ง Node Exporter ทำได้ง่าย ดาวน์โหลด Binary จาก GitHub Release คลายไฟล์แล้วย้าย Binary ไปไว้ที่ /usr/local/bin/ สร้าง Systemd Service File แล้ว Enable และ Start Service หลังจากติดตั้งเสร็จ สามารถเข้าไปดู Metrics ได้ที่ http://server-ip:9100/metrics จะเห็นรายการ Metrics จำนวนมากที่ Node Exporter Expose ออกมา สิ่งสำคัญคือต้องเปิด Firewall ให้ Prometheus Server สามารถเข้าถึง Port 9100 ของ Node Exporter ได้ แต่ไม่ควรเปิดให้ทุกคนเข้าถึง เพราะ Metrics อาจมีข้อมูลที่ Sensitive

Key Metrics ที่ Node Exporter ให้มา

Node Exporter ให้ Metrics จำนวนมาก แต่ Metrics หลักๆ ที่ Sysadmin ต้องรู้มีดังนี้ CPU Metrics ได้แก่ node_cpu_seconds_total ที่แสดงเวลาที่ CPU ใช้ในแต่ละ Mode (user, system, idle, iowait, steal) ใช้คำนวณ CPU Usage ได้ และ node_load1, node_load5, node_load15 ที่แสดง Load Average Memory Metrics ได้แก่ node_memory_MemTotal_bytes ที่แสดง Total RAM, node_memory_MemAvailable_bytes ที่แสดง Available RAM (ไม่ใช่ Free RAM), node_memory_SwapTotal_bytes และ node_memory_SwapFree_bytes สำหรับ Swap Usage Disk Metrics ได้แก่ node_filesystem_size_bytes และ node_filesystem_avail_bytes สำหรับ Disk Space, node_disk_read_bytes_total และ node_disk_written_bytes_total สำหรับ Disk I/O Network Metrics ได้แก่ node_network_receive_bytes_total และ node_network_transmit_bytes_total สำหรับ Network Traffic, node_network_up สำหรับ Interface Status Filesystem Metrics ได้แก่ node_filesystem_files สำหรับ Total Inodes และ node_filesystem_files_free สำหรับ Free Inodes ซึ่ง Inode Exhaustion เป็นปัญหาที่พบบ่อยแต่มักถูกมองข้าม

Windows Server Monitoring: WMI Exporter

สำหรับ Windows Server ใช้ Windows Exporter (เดิมชื่อ WMI Exporter) ซึ่งทำหน้าที่เหมือน Node Exporter แต่สำหรับ Windows การติดตั้ง Windows Exporter ทำได้โดย ดาวน์โหลด MSI Installer จาก GitHub Release แล้ว Install ด้วย msiexec หรือ Double-click MSI File ระบบจะสร้าง Windows Service ชื่อ windows_exporter อัตโนมัติ Default Port คือ 9182 Metrics หลักที่ Windows Exporter ให้มา ได้แก่ windows_cpu_time_total สำหรับ CPU Usage, windows_cs_physical_memory_bytes สำหรับ Physical Memory, windows_logical_disk_size_bytes และ windows_logical_disk_free_bytes สำหรับ Disk Space, windows_net_bytes_total สำหรับ Network Traffic, windows_service_state สำหรับ Windows Service Status และ windows_os_info สำหรับ OS Information เช่น Version, Build Number สิ่งที่ต้องระวังคือ Windows Exporter อาจกิน CPU สูงถ้าเปิด Collector ทั้งหมด ควรเลือก Enable เฉพาะ Collector ที่ต้องการ เช่น cpu, memory, logical_disk, net, os, service, system

PromQL Basics สำหรับ Sysadmin

Data Types ใน PromQL

PromQL (Prometheus Query Language) เป็นภาษาที่ใช้ Query ข้อมูลจาก Prometheus มี 4 Data Types หลักคือ Instant Vector ที่เป็น Set ของ Time Series ที่มี Single Sample ต่อ Time Series ณ เวลาหนึ่ง เช่น node_memory_MemAvailable_bytes Range Vector ที่เป็น Set ของ Time Series ที่มีหลาย Samples ในช่วงเวลาหนึ่ง เช่น node_cpu_seconds_total[5m] Scalar ที่เป็นตัวเลขทศนิยมธรรมดา และ String ที่เป็น Text (ใช้น้อยมาก) การเข้าใจ Data Types เหล่านี้สำคัญมากเพราะ Functions บางตัวรับเฉพาะ Data Type ที่กำหนด

Functions ที่ใช้บ่อยที่สุด

Functions ที่ Sysadmin ต้องรู้มีดังนี้ rate() ใช้คำนวณ Per-second Rate ของ Counter Metric เช่น rate(node_cpu_seconds_total{mode=”idle”}[5m]) จะได้ Rate ของ CPU Idle Time ในช่วง 5 นาที ใช้คำนวณ CPU Usage ได้ rate() เหมาะสำหรับ Alert Rules เพราะ Smooth กว่า increase() ใช้คำนวณ Total Increase ของ Counter ในช่วงเวลาที่กำหนด เช่น increase(node_network_receive_bytes_total[1h]) จะได้จำนวน Bytes ที่ Receive ในชั่วโมงที่ผ่านมา histogram_quantile() ใช้คำนวณ Percentile จาก Histogram Metric เช่น histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) จะได้ 95th Percentile ของ Request Duration avg_over_time(), max_over_time(), min_over_time() ใช้คำนวณค่าเฉลี่ย, สูงสุด, ต่ำสุดในช่วงเวลาที่กำหนด predict_linear() ใช้ทำนายค่าในอนาคตจาก Trend ปัจจุบัน เช่น predict_linear(node_filesystem_avail_bytes[6h], 24*3600) จะทำนายว่าอีก 24 ชั่วโมง Disk จะเหลือเท่าไหร่ ซึ่งเป็น Function ที่ทรงพลังมากสำหรับ Capacity Planning

PromQL Recipes สำหรับ Sysadmin

สูตร PromQL ที่ใช้บ่อยในการ Monitor Server มีดังนี้ CPU Usage Percentage คำนวณจาก 100 – (avg by(instance)(rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) ได้เปอร์เซ็นต์ CPU ที่ถูกใช้ Memory Usage Percentage คำนวณจาก (1 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 ได้เปอร์เซ็นต์ RAM ที่ถูกใช้ Disk Usage Percentage คำนวณจาก (1 – node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 ได้เปอร์เซ็นต์ Disk ที่ถูกใช้ Network Traffic คำนวณจาก rate(node_network_receive_bytes_total[5m]) * 8 ได้ Network Receive Rate เป็น bits/sec Disk I/O คำนวณจาก rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m]) ได้ Total Disk I/O เป็น bytes/sec สูตรเหล่านี้สามารถนำไปใช้ใน Grafana Dashboard หรือ Alert Rules ได้ทันที

Alerting Rules: แจ้งเตือนก่อนปัญหาเกิด

การเขียน Alert Rules

Alert Rules ของ Prometheus เขียนในไฟล์ YAML แยกจาก prometheus.yml แต่ต้องอ้างอิงใน rule_files แต่ละ Alert Rule ประกอบด้วย alert (ชื่อ Alert), expr (PromQL Expression ที่ Trigger Alert), for (ระยะเวลาที่ Condition ต้องเป็นจริงก่อน Fire), labels (Labels เพิ่มเติมเช่น severity) และ annotations (ข้อมูลเพิ่มเติมเช่น summary, description)

Alert Rules ที่ Sysadmin ต้องมี

Alert Rules พื้นฐานที่ทุก Server ควรมี ได้แก่ Instance Down ที่ Alert เมื่อ Target ล่ม โดยใช้ Expression up == 0 ตั้ง for 5m เพื่อหลีกเลี่ยง False Positive จาก Network Glitch ชั่วคราว High CPU Usage ที่ Alert เมื่อ CPU Usage สูงเกิน 80% ต่อเนื่อง 15 นาที Disk Almost Full ที่ Alert เมื่อ Disk Usage เกิน 85% เป็น Warning และ 95% เป็น Critical นี่คือ Alert ที่สำคัญที่สุดสำหรับ Sysadmin เพราะ Disk Full จะทำให้ระบบล่มทันที Memory Pressure ที่ Alert เมื่อ Available Memory ต่ำกว่า 10% ของ Total RAM High Swap Usage ที่ Alert เมื่อ Swap Usage สูง ซึ่งบ่งชี้ว่า RAM ไม่เพียงพอ High I/O Wait ที่ Alert เมื่อ CPU iowait สูง ซึ่งบ่งชี้ว่า Disk I/O เป็น Bottleneck Network Interface Down ที่ Alert เมื่อ Network Interface ล่ม Service Down ที่ Alert เมื่อ Critical Service เช่น MySQL, Nginx, Apache หยุดทำงาน Disk Prediction ที่ใช้ predict_linear ทำนายว่า Disk จะเต็มภายใน 24 ชั่วโมง และ Reboot Detection ที่ Alert เมื่อ Server Reboot (node_boot_time_seconds เปลี่ยน)

Alertmanager: จัดการ Notification อย่างมืออาชีพ

การติดตั้งและ Configuration

Alertmanager เป็น Component ที่รับ Alert จาก Prometheus แล้วจัดการส่ง Notification ไปยังช่องทางต่างๆ การติดตั้งทำได้เหมือนกับ Prometheus คือดาวน์โหลด Binary แล้วสร้าง Systemd Service Configuration หลักของ Alertmanager อยู่ในไฟล์ alertmanager.yml ประกอบด้วย route ที่กำหนดว่า Alert ไหนส่งไปช่องทางไหน receivers ที่กำหนดช่องทางการแจ้งเตือน และ inhibit_rules ที่กำหนดว่า Alert ไหน Suppress Alert อื่น

การตั้ง Email Notification

Email เป็นช่องทาง Notification พื้นฐานที่ทุกองค์กรมี การตั้ง Email ใน Alertmanager ต้องกำหนด SMTP Server, From Address, To Address และ Template สำหรับ Email Content ข้อดีของ Email คือทุกคนมีและสามารถ Archive ได้ ข้อเสียคือช้าและอาจพลาดถ้าไม่ได้เช็ค Email บ่อย

การตั้ง Slack Notification

Slack เป็นช่องทางที่นิยมมากสำหรับ IT Team เพราะได้รับ Notification ทันทีและสามารถ Discuss ปัญหาใน Thread เดียวกันได้ การตั้ง Slack ใน Alertmanager ต้องสร้าง Incoming Webhook ใน Slack แล้วกำหนด webhook_url ใน Alertmanager Configuration สามารถ Customize Message Format ด้วย Template ให้แสดงข้อมูลที่ต้องการ เช่น Alert Name, Severity, Instance, Description

การตั้ง LINE Notification สำหรับทีมในไทย

สำหรับทีม IT ในประเทศไทย LINE เป็นช่องทาง Communication หลักที่ทุกคนใช้ การส่ง Alert ผ่าน LINE ทำได้โดยใช้ LINE Notify API ร่วมกับ Webhook Receiver ของ Alertmanager หรือสร้าง Custom Webhook ที่รับ Alert จาก Alertmanager แล้ว Forward ไปยัง LINE Notify API อีกวิธีหนึ่งคือใช้ Alertmanager Webhook Receiver ส่งไปยัง Script ที่เรียก LINE Notify API ด้วย cURL การใช้ LINE Notify มีข้อดีคือ ทุกคนในทีมได้รับ Notification ทันทีบนมือถือ สามารถส่งไปที่ Group ได้ และ Setup ไม่ยาก

Grouping, Inhibition และ Silencing

Alertmanager มี Feature สำคัญ 3 อย่างสำหรับจัดการ Alert ให้ไม่ท่วม Grouping คือการรวม Alert ที่เกี่ยวข้องกันส่งเป็น Notification เดียว เช่น ถ้า Server 10 เครื่องล่มพร้อมกัน แทนที่จะส่ง 10 Notifications จะรวมเป็น 1 Notification ที่แสดงรายชื่อทั้ง 10 เครื่อง Inhibition คือการ Suppress Alert ที่ Priority ต่ำกว่าเมื่อ Alert ที่ Priority สูงกว่า Fire อยู่ เช่น ถ้า Server ล่ม (Critical) ไม่ต้องส่ง Alert เรื่อง CPU High (Warning) ของ Server ตัวเดียวกัน Silencing คือการ Mute Alert ชั่วคราว เช่น ระหว่าง Maintenance Window ไม่ต้องการ Alert สามารถ Silence ผ่าน Alertmanager Web UI

Grafana Dashboard: Visualization ที่สวยงามและใช้งานง่าย

การติดตั้ง Grafana

Grafana เป็น Open Source Dashboard ที่ใช้คู่กับ Prometheus เป็นมาตรฐาน การติดตั้งสามารถทำได้หลายวิธี ทั้ง Docker, Package Manager (apt, yum) หรือ Download Binary โดยตรง หลังจากติดตั้ง Grafana จะรันที่ Port 3000 Default Login คือ admin/admin สิ่งแรกที่ต้องทำหลัง Login คือเพิ่ม Prometheus เป็น Data Source โดยไปที่ Configuration > Data Sources > Add data source > Prometheus แล้วกรอก URL ของ Prometheus Server (เช่น http://localhost:9090)

Node Exporter Full Dashboard

แทนที่จะสร้าง Dashboard เอง Grafana มี Dashboard Marketplace ที่มี Dashboard สำเร็จรูปให้ Import ได้ทันที Dashboard ที่ได้รับความนิยมสูงสุดสำหรับ Node Exporter คือ Node Exporter Full (Dashboard ID 1860) ซึ่งแสดงข้อมูลครบทุกด้าน CPU Usage แบบ Detailed แยก Mode (user, system, iowait, steal) Memory Usage แสดงทั้ง Used, Cached, Buffers, Available Disk Space แยกตาม Mount Point Disk I/O แสดง Read/Write IOPS และ Throughput Network Traffic แยกตาม Interface System Info แสดง Uptime, Kernel Version, CPU Count การ Import Dashboard ทำได้ง่ายมาก ไปที่ Dashboards > Import กรอก Dashboard ID 1860 แล้วเลือก Prometheus Data Source ก็จะได้ Dashboard ที่สวยงามพร้อมใช้งานทันที

การสร้าง Custom Dashboard

สำหรับการสร้าง Dashboard เฉพาะขององค์กร Grafana มี Panel Types หลากหลาย เช่น Time Series สำหรับแสดง Metrics ตาม Timeline, Stat สำหรับแสดงค่าปัจจุบันแบบตัวเลขใหญ่, Gauge สำหรับแสดงค่าเทียบกับ Threshold, Table สำหรับแสดงข้อมูลแบบตาราง, Heatmap สำหรับแสดง Distribution ตามเวลา และ Alert List สำหรับแสดงรายการ Alert ปัจจุบัน การออกแบบ Dashboard ที่ดีควรมี Overview Dashboard ที่แสดงสถานะรวมของทุก Server เห็นภาพรวมในหน้าเดียว และ Detail Dashboard สำหรับแต่ละ Server ที่ Drill Down ดูรายละเอียดได้

Blackbox Exporter: Monitor URL และ Endpoint ภายนอก

Blackbox Exporter เป็น Exporter สำหรับ Probing Endpoint จากภายนอก (Black-box Monitoring) รองรับ Protocol หลายอย่าง HTTP/HTTPS สำหรับ Monitor Website ว่ายังเข้าถึงได้หรือไม่ ใช้เวลาเท่าไหร่ SSL Certificate หมดอายุเมื่อไหร่ TCP สำหรับ Monitor ว่า Port เปิดอยู่หรือไม่ เช่น Database Port, Application Port ICMP (Ping) สำหรับ Monitor ว่า Host ยัง Alive หรือไม่ DNS สำหรับ Monitor DNS Resolution ว่าถูกต้องและเร็วหรือไม่ การใช้ Blackbox Exporter เหมาะสำหรับ Monitor Website ขององค์กร, API Endpoint, Third-party Service และ SLA Monitoring สำหรับ Sysadmin Blackbox Exporter มีค่ามากเพราะสามารถ Alert ได้ทันทีเมื่อ Website ล่ม SSL Certificate ใกล้หมดอายุ หรือ Response Time สูงกว่าปกติ

SNMP Exporter: Monitor Network Devices

SNMP Exporter เป็น Exporter สำหรับ Monitor อุปกรณ์ Network ที่รองรับ SNMP เช่น Switch, Router, Firewall, Access Point, UPS และ Printer SNMP Exporter ทำงานโดยรับ SNMP OID Configuration (สร้างจาก snmp.yml Generator) แล้ว Scrape SNMP Data จาก Target Metrics ที่ได้จาก Network Device เช่น Interface Traffic (In/Out bytes), Interface Status (Up/Down), Interface Errors และ Discards, CPU และ Memory ของ Device, Port Status, Temperature สำหรับ Sysadmin ที่ดูแล Network Device จำนวนมาก SNMP Exporter ช่วยให้สามารถ Monitor ทุกอย่างจาก Prometheus/Grafana ที่เดียว แทนที่จะต้องเข้า Web Management ของแต่ละ Device เพื่อดูสถานะ

Pushgateway: สำหรับ Batch Jobs และ Cron

Pushgateway เป็น Component ที่ออกแบบมาสำหรับ Short-lived Jobs ที่ไม่สามารถ Expose Metrics Endpoint ให้ Prometheus Scrape ได้ เช่น Batch Script ที่รันแล้วจบ, Cron Job ที่รันตามเวลา, Backup Script ที่ต้อง Report สถานะ การใช้ Pushgateway ทำได้โดยให้ Script Push Metrics ไปที่ Pushgateway ผ่าน HTTP API แล้ว Prometheus จะ Scrape จาก Pushgateway อีกที ตัวอย่างเช่น Backup Script ที่ Push Metrics เรื่อง Backup Status (สำเร็จหรือล้มเหลว) Backup Duration (ใช้เวลาเท่าไหร่) Backup Size (ขนาดไฟล์ Backup) แล้วสามารถ Alert ได้เมื่อ Backup ล้มเหลวหรือใช้เวลานานกว่าปกติ ข้อควรระวังคือ Pushgateway ไม่ได้ออกแบบมาให้ใช้แทน Pull Model ปกติ ถ้า Job รันตลอดเวลาควรให้ Prometheus Scrape โดยตรง และ Metrics ที่ Push ไป Pushgateway จะค้างอยู่จนกว่าจะ Delete ด้วยตนเอง ต้องจัดการเรื่อง Stale Metrics ด้วย

Long-term Storage: เก็บ Metrics ข้อมูลระยะยาว

Prometheus TSDB ออกแบบมาสำหรับ Short-term Storage (Default 15 วัน) ถ้าต้องการเก็บ Metrics ระยะยาว เช่น 1 ปี หรือมากกว่า มีหลายทางเลือก Thanos เป็น Open Source ที่เพิ่ม Long-term Storage และ Global View ให้ Prometheus ทำงานโดยเก็บ TSDB Blocks ใน Object Storage เช่น S3, GCS, MinIO พร้อม Compaction และ Downsampling อัตโนมัติ Thanos เป็นทางเลือกที่นิยมที่สุดสำหรับ Production Cortex เป็น Multi-tenant Long-term Storage สำหรับ Prometheus เหมาะสำหรับ Managed Service ที่ให้บริการ Monitoring หลาย Tenant VictoriaMetrics เป็น Time Series Database ที่ Compatible กับ Prometheus รองรับ Remote Write จาก Prometheus โดยตรง มี Performance สูงกว่า Prometheus TSDB มาก เหมาะสำหรับ Environment ที่มี Cardinality สูง Mimir เป็น Long-term Storage จาก Grafana Labs ออกแบบมาให้ Scale ได้สูงมาก สำหรับ Sysadmin ที่เริ่มต้น แนะนำให้ใช้ Prometheus TSDB ก่อน ตั้ง Retention 30-90 วัน ถ้าต้องการเก็บนานกว่านั้นค่อยพิจารณา Thanos หรือ VictoriaMetrics

Recording Rules: Pre-compute PromQL เพื่อ Performance

Recording Rules เป็น Feature ที่ช่วย Pre-compute PromQL Expression ที่ซับซ้อนหรือใช้บ่อย แล้วเก็บผลลัพธ์เป็น Time Series ใหม่ ข้อดีคือลด Query Time เมื่อเปิด Dashboard หรือ Evaluate Alert Rules ลด Load บน Prometheus Server และสามารถสร้าง Aggregated Metrics ที่ใช้ซ้ำได้ ตัวอย่างเช่น สูตรคำนวณ CPU Usage ที่ซับซ้อนสามารถสร้างเป็น Recording Rule ชื่อ instance:node_cpu_utilisation:rate5m แล้วใช้ชื่อนี้ใน Dashboard และ Alert Rules แทน PromQL เต็มๆ ทำให้ Dashboard Load เร็วขึ้นมากเมื่อมีหลาย Panel ที่ใช้สูตรเดียวกัน

Best Practices สำหรับ Production

Sizing Prometheus Server

การ Size Prometheus Server ขึ้นกับจำนวน Time Series ที่ต้อง Store สูตรคร่าวๆ คือ RAM ต้องการประมาณ 1-2 GB ต่อ 100,000 Active Time Series Disk ต้องการประมาณ 1-2 bytes ต่อ Sample ต่อ Series เช่น ถ้ามี 100,000 Series, Scrape ทุก 15 วินาที เก็บ 15 วัน จะใช้ Disk ประมาณ 100GB CPU ขึ้นกับ Query Load และจำนวน Rules สำหรับ Environment ขนาดกลาง (50-100 Server) Prometheus Server ขนาด 4 CPU, 8-16 GB RAM, 200 GB SSD เพียงพอ

High Availability

Prometheus ไม่มี Built-in Clustering แต่สามารถทำ HA ได้โดยรัน Prometheus 2 ตัวที่ Scrape Target เดียวกัน ทั้งสอง Instance จะมีข้อมูลเหมือนกัน ใช้ Alertmanager แบบ Cluster เพื่อ Deduplicate Alert ใช้ Load Balancer หน้า Grafana เพื่อ Query จาก Instance ที่พร้อมใช้งาน วิธีนี้ง่ายและเพียงพอสำหรับ SMB ที่ต้องการ HA ระดับหนึ่ง

Security Considerations

Prometheus ไม่มี Built-in Authentication ดังนั้นต้องป้องกันด้วยวิธีอื่น เช่น ใช้ Reverse Proxy (Nginx, Traefik) ข้างหน้า Prometheus เพื่อเพิ่ม Authentication จำกัด Network Access ให้เฉพาะ Monitoring Network เข้าถึง Prometheus ได้ ใช้ TLS สำหรับการสื่อสารระหว่าง Prometheus กับ Target (Exporter) ไม่เปิด Prometheus, Grafana หรือ Alertmanager ให้เข้าจาก Internet โดยตรง และตรวจสอบว่า Exporter ไม่ Expose Sensitive Data ใน Metrics

สรุป: เริ่มต้น Server Monitoring ด้วย Prometheus ง่ายกว่าที่คิด

Prometheus พร้อม Node Exporter, Alertmanager และ Grafana เป็น Stack ที่ทรงพลังสำหรับ Server Monitoring ที่ทุกองค์กรสามารถนำไปใช้ได้โดยไม่ต้องเสียค่า License สิ่งที่ได้คือ Real-time Monitoring ที่เห็นสถานะ Server ทุกเครื่องได้ทันที Proactive Alerting ที่แจ้งเตือนก่อนปัญหาเกิด Historical Data ที่ช่วยในการ Troubleshooting และ Capacity Planning Dashboard ที่สวยงามและ Customizable ได้ตามต้องการ สำหรับ Sysadmin ที่เริ่มต้น ขั้นตอนแนะนำคือ ติดตั้ง Prometheus Server (Docker หรือ Bare Metal) ติดตั้ง Node Exporter บน Linux Server ที่สำคัญที่สุดก่อน ติดตั้ง Grafana แล้ว Import Node Exporter Full Dashboard สร้าง Alert Rules สำหรับ Disk Full, Instance Down, High CPU ตั้ง Alertmanager ส่ง Email หรือ LINE จากนั้นค่อยๆ ขยายไปติดตั้ง Windows Exporter, Blackbox Exporter และ SNMP Exporter ตามความต้องการ ในยุค 2026 ที่ Downtime มีราคาแพงมากขึ้นเรื่อยๆ การลงทุนเวลาสักวันเพื่อ Setup Monitoring System จะช่วยประหยัดเวลาและเงินได้มหาศาลในระยะยาว

.
.
.
.
.

SiamCafe.net — ชุมชน IT ที่ใหญ่ที่สุด · Siam2R.com — Portfolio งาน IT