Home » Network Backup and Recovery: Configuration Backup, ISSU, Graceful Restart, NSF และ Disaster Recovery
Network Backup and Recovery: Configuration Backup, ISSU, Graceful Restart, NSF และ Disaster Recovery
Network Backup and Recovery: Configuration Backup, ISSU, Graceful Restart, NSF และ Disaster Recovery
Network Backup and Recovery เป็นส่วนสำคัญของ network operations ที่มักถูกมองข้าม Configuration Backup เก็บ config ไว้ restore เมื่อมีปัญหา, ISSU (In-Service Software Upgrade) อัพเกรด OS โดยไม่ downtime, Graceful Restart ให้ routing protocol ทำงานต่อระหว่าง restart, NSF (Non-Stop Forwarding) ให้ data plane forward ต่อแม้ control plane restart และ Disaster Recovery วางแผนรับมือ catastrophic failures
Network downtime มีค่าใช้จ่ายสูงมาก : Fortune 500 companies estimate $5,600-9,000 per minute of downtime สาเหตุหลักของ outages: human error (40%), hardware failure (25%), software bugs (20%), security incidents (15%) Configuration backup + tested recovery procedures สามารถลด MTTR (Mean Time To Recovery) จากหลายชั่วโมงเหลือนาที
Configuration Backup Methods
Method
How
Automation
TFTP/SCP/SFTP
copy running-config tftp://[server]/[file] — manual or scheduled
EEM script, cron job on server
RANCID
Open source: login to devices, collect configs, store in version control (CVS/Git)
Scheduled (cron), diff reports via email
Oxidized
Modern RANCID replacement: Ruby-based, Git backend, REST API, web UI
Scheduled, event-driven, Git history
Ansible
ansible.netcommon.cli_command → backup config → store in Git
Playbook + cron/CI pipeline
Cisco DNA Center
Auto-backup on config change → history + rollback capability
Automatic (event-driven)
SolarWinds NCM
Network Configuration Manager: scheduled backup, compliance, change detection
Scheduled + real-time change alerts
Configuration Management Best Practices
Practice
Detail
Version Control
Store configs in Git → track every change, who, when, diff → rollback to any version
Scheduled Backups
Daily automatic backup → verify backup integrity weekly
Change Detection
Alert on config change → detect unauthorized changes (compliance)
Golden Config
Template-based config → compare running vs golden → flag deviations
Pre-Change Backup
Always backup BEFORE making changes → rollback point if change fails
Test Restore
Regularly test: can you actually restore a device from backup? (many never test)
Off-Site Copy
Store backup copy off-site → survive DC disaster
ISSU (In-Service Software Upgrade)
Feature
รายละเอียด
คืออะไร
Upgrade device software without traffic interruption — zero/minimal downtime
Requirement
Dual supervisor/RP (Route Processor) — one runs old, one loads new → switchover
Process
Load new image on standby RP → switchover to standby → old RP loads new image → sync
SSO
Stateful Switchover: standby RP syncs state from active → seamless switchover
Platforms
Cisco Catalyst 9K (StackWise Virtual), Nexus (dual sup), Juniper (dual RE)
Limitation
Not all upgrades are ISSU-compatible → check release notes (major version changes may require reload)
Validation
show issu state, show redundancy → verify ISSU readiness before upgrade
Graceful Restart (GR)
Protocol
How GR Works
Timer
BGP
Restarting router sends End-of-RIB → helper retains routes during restart → session preserved
Restart timer: 120s (default), stale-path: 360s
OSPF
Restarting router sends Grace-LSA → helpers don’t flush routes → router re-syncs after restart
Grace period: 120s (default)
IS-IS
Similar to OSPF: restarting router sends TLV → neighbors retain adjacency
T3 timer for restart
EIGRP
NSF-aware: neighbor retains routes → restarting router re-syncs topology
Signal timer from restarting router
Helper Mode
Neighbors must be GR-aware (helper) → retain routes for restarting peer during restart window
Helper retains routes until restart complete or timer expires
NSF (Non-Stop Forwarding)
Feature
รายละเอียด
คืออะไร
Data plane continues forwarding while control plane restarts → no packet loss during RP switchover
How
FIB (Forwarding Information Base) retained in hardware → packets forwarded using stale FIB → control plane rebuilds
With SSO
SSO + NSF: standby RP takes over (SSO) + forwarding continues (NSF) → near-zero downtime
With GR
NSF + GR: forwarding continues (NSF) + routing neighbors don’t tear down adjacency (GR)
Requirement
Dual RP/supervisor + NSF-capable platform + GR-capable neighbors
Limitation
Stale routes during restart → if topology changes during restart → black hole possible
Disaster Recovery
Tier
Recovery
RPO/RTO
Cost
Cold Site
Empty facility + configs backed up → procure and configure equipment
RPO: hours-days, RTO: days-weeks
Low
Warm Site
Pre-configured equipment → restore configs from backup → activate
RPO: hours, RTO: hours
Medium
Hot Site
Active-standby DC → all configs synced → failover when primary fails
RPO: minutes, RTO: minutes
High
Active-Active
Both DCs active → load balanced → one fails → other absorbs traffic
RPO: ~0, RTO: seconds
Very High
ทิ้งท้าย: Backup = Insurance, Recovery = Practice
Network Backup and Recovery Config Backup: TFTP/SCP, Oxidized/RANCID (Git), Ansible, DNA Center — daily auto-backup + version control Best Practices: Git versioning, change detection, golden config compliance, pre-change backup, test restore regularly ISSU: zero-downtime upgrade (dual RP + SSO) — load new on standby → switchover → sync GR: routing protocols (BGP, OSPF, IS-IS) retain routes during restart — helper mode on neighbors NSF: data plane forwards using stale FIB during control plane restart — combine with SSO + GR DR: cold (days), warm (hours), hot (minutes), active-active (seconds) — cost vs recovery time tradeoff Key: backups are worthless if never tested — schedule quarterly DR drills + verify restore procedures
อ่านเพิ่มเติมเกี่ยวกับ Network Automation Ansible Terraform GitOps และ Network Monitoring SNMP NetFlow gNMI ที่ siamlancard.com หรือจาก icafeforex.com และ siam2r.com