IT Disaster Recovery Plan (DRP) คืออะไร? วางแผนกู้คืนระบบ IT ฉบับสมบูรณ์ 2026

April 8, 2026

0 Views

SaveSavedRemoved 0

บทนำ: เมื่อระบบ IT ล่ม — คุณพร้อมรับมือแค่ไหน?

จะเกิดอะไรขึ้นถ้าเช้าวันจันทร์ คุณมาถึงออฟฟิศแล้วพบว่า server room ถูกน้ำท่วม, RAID array ของ storage พังทั้ง array, ransomware เข้ารหัสไฟล์ทั้งหมดบน file server, หรือ cloud provider ที่คุณใช้อยู่มี major outage นานหลายชั่วโมง? พนักงานทำงานไม่ได้ ลูกค้าติดต่อไม่ได้ คำสั่งซื้อหาย ข้อมูลสูญ และทุกวินาทีที่ผ่านไปคือเงินที่สูญเสีย

สถานการณ์เหล่านี้ไม่ใช่เรื่องสมมติ — เกิดขึ้นจริงกับองค์กรในประเทศไทยทุกปี น้ำท่วมใหญ่ปี 2554 ทำให้หลายบริษัทสูญเสียข้อมูลถาวร เพราะ backup อยู่ใน data center เดียวกับ production และไม่มี offsite copy การโจมตี ransomware ที่เพิ่มขึ้นทุกปีทำให้แม้แต่โรงพยาบาลก็ต้องหยุดให้บริการเพราะระบบ IT ล่มทั้งหมด

IT Disaster Recovery Plan (DRP) คือแผนที่จะช่วยให้องค์กรกู้คืนระบบ IT ได้อย่างรวดเร็วและเป็นระบบเมื่อเกิดเหตุการณ์ไม่คาดคิด บทความนี้จะสอนทุกอย่างเกี่ยวกับการวางแผน DR ตั้งแต่ความแตกต่างระหว่าง DRP กับ BCP, การคำนวณ RTO/RPO, การวิเคราะห์ความเสี่ยง, DR strategies ระดับต่างๆ, การทดสอบแผน DR, การสร้าง runbook, จนถึง DR automation ในยุค cloud

ส่วนที่ 1: DRP vs BCP — ความแตกต่างที่ต้องเข้าใจ

1.1 BCP คืออะไร?

Business Continuity Plan (BCP) คือแผนที่ครอบคลุม ทุกด้าน ของธุรกิจ ไม่ใช่แค่ IT — รวมถึงการบริหารจัดการบุคลากร สถานที่ทำงานสำรอง การสื่อสาร supply chain และกระบวนการทำงานต่างๆ เมื่อเกิดเหตุการณ์ที่ทำให้ธุรกิจไม่สามารถดำเนินงานได้ตามปกติ

1.2 DRP คืออะไร?

Disaster Recovery Plan (DRP) คือ ส่วนหนึ่ง ของ BCP ที่โฟกัสเฉพาะ การกู้คืนระบบ IT — servers, databases, applications, networks, storage เพื่อให้ระบบ IT กลับมาทำงานได้ภายในระยะเวลาที่กำหนด

BCP vs DRP Comparison:

┌────────────────────────────────────────────┐
│              BCP (Business Continuity)      │
│                                            │
│  ┌──────────────────────────────────────┐  │
│  │         DRP (Disaster Recovery)       │  │
│  │                                      │  │
│  │  ├── Server recovery                │  │
│  │  ├── Database restore               │  │
│  │  ├── Network failover               │  │
│  │  ├── Application recovery           │  │
│  │  └── Data restoration               │  │
│  └──────────────────────────────────────┘  │
│                                            │
│  ├── People & staffing plan                │
│  ├── Alternate work locations              │
│  ├── Communication plan                    │
│  ├── Supply chain continuity               │
│  ├── Legal & regulatory compliance         │
│  ├── Financial continuity                  │
│  ├── Customer communication                │
│  └── Crisis management                     │
└────────────────────────────────────────────┘

BCP:
├── ขอบเขต: ทุกด้านของธุรกิจ
├── เจ้าของ: CEO / Board / C-Level
├── เน้น: ให้ธุรกิจดำเนินต่อไปได้
├── ครอบคลุม: People, Process, Technology
└── มาตรฐาน: ISO 22301

DRP:
├── ขอบเขต: IT systems เท่านั้น
├── เจ้าของ: CTO / IT Director / IT Manager
├── เน้น: กู้คืนระบบ IT ให้ทำงานได้
├── ครอบคลุม: Technology (servers, network, data)
└── มาตรฐาน: ISO 27031, NIST SP 800-34

ส่วนที่ 2: RTO และ RPO — ตัวชี้วัดหลักของ DR

2.1 RTO (Recovery Time Objective)

RTO คือ ระยะเวลาสูงสุดที่ยอมรับได้ ที่ระบบ IT จะหยุดทำงาน (downtime) นับตั้งแต่เกิดเหตุจนระบบกลับมาใช้งานได้ ยิ่ง RTO น้อย ยิ่งต้องลงทุนมาก เพราะต้องมีระบบสำรองที่พร้อมใช้งานเร็วขึ้น

2.2 RPO (Recovery Point Objective)

RPO คือ ปริมาณข้อมูลสูงสุดที่ยอมให้สูญเสียได้ ถ้า RPO = 1 ชั่วโมง หมายความว่ายอมเสียข้อมูลย้อนหลังได้ไม่เกิน 1 ชั่วโมง ดังนั้นต้อง backup หรือ replicate ข้อมูลอย่างน้อยทุก 1 ชั่วโมง

RTO/RPO Visual Timeline:

                RPO                           RTO
         ◀────────────▶                 ◀────────────────▶

ข้อมูลที่     Last         Disaster      ระบบกลับมา
ยอมเสียได้    Backup       Occurs        ทำงานได้
         │              │              │
─────────┼──────────────┼──────────────┼────────────────▶ เวลา
         │              │              │
    ข้อมูลตั้งแต่    จุดเกิดเหตุ      ระบบ recovered
    last backup
    จะสูญเสีย

ตัวอย่าง RTO/RPO ตามประเภทระบบ:

┌──────────────────┬────────────┬────────────┬─────────────┐
│ ระบบ              │ RTO        │ RPO        │ DR Strategy │
├──────────────────┼────────────┼────────────┼─────────────┤
│ E-commerce        │ 15 นาที    │ 0 (zero)   │ Active-     │
│ (รายได้ตรง)       │            │            │ Active      │
├──────────────────┼────────────┼────────────┼─────────────┤
│ Core Banking      │ 1 ชั่วโมง  │ 0 (zero)   │ Hot         │
│                  │            │            │ Standby     │
├──────────────────┼────────────┼────────────┼─────────────┤
│ ERP/SAP          │ 4 ชั่วโมง  │ 1 ชั่วโมง  │ Warm        │
│                  │            │            │ Standby     │
├──────────────────┼────────────┼────────────┼─────────────┤
│ Email/Office 365 │ 8 ชั่วโมง  │ 4 ชั่วโมง  │ Pilot       │
│                  │            │            │ Light       │
├──────────────────┼────────────┼────────────┼─────────────┤
│ HR/Payroll       │ 24 ชั่วโมง │ 24 ชั่วโมง │ Backup/     │
│ (ใช้เดือนละครั้ง) │            │            │ Restore     │
├──────────────────┼────────────┼────────────┼─────────────┤
│ Archive/ข้อมูลเก่า│ 72 ชั่วโมง │ 1 สัปดาห์  │ Backup/     │
│                  │            │            │ Restore     │
└──────────────────┴────────────┴────────────┴─────────────┘

กฎ: ยิ่ง RTO/RPO ต่ำ → ค่าใช้จ่ายยิ่งสูง
├── RTO 0 = Active-Active (แพงมาก)
├── RTO 1h = Hot Standby (แพง)
├── RTO 4h = Warm Standby (ปานกลาง)
├── RTO 24h = Pilot Light (ประหยัด)
└── RTO 72h+ = Backup/Restore (ประหยัดสุด)

ส่วนที่ 3: Risk Assessment และ BIA

3.1 Business Impact Analysis (BIA)

BIA (Business Impact Analysis) คือกระบวนการวิเคราะห์ว่าถ้าระบบ IT แต่ละตัวล่ม จะส่งผลกระทบต่อธุรกิจอย่างไร ทั้งในแง่รายได้ ชื่อเสียง กฎหมาย และการดำเนินงาน BIA เป็นขั้นตอนแรกที่ต้องทำก่อนวางแผน DR เพราะจะเป็นตัวกำหนดว่าระบบไหนสำคัญที่สุด (Critical) และต้องการ RTO/RPO เท่าไหร่:

BIA Template:

สำหรับแต่ละระบบ IT ต้องประเมินสิ่งเหล่านี้:

1. ระบุ Business Functions:
   ├── ระบบนี้ support business function อะไร?
   ├── ใครใช้งาน? กี่คน?
   ├── ใช้งานเมื่อไหร่? (24/7? business hours?)
   └── มี peak period ไหม? (เช่น สิ้นเดือน, เทศกาล)

2. Impact Assessment (เมื่อระบบล่ม):
   ├── Financial Impact:
   │   ├── สูญเสียรายได้ต่อชั่วโมง: _____ บาท
   │   ├── ค่าปรับ/penalty: _____ บาท
   │   ├── ค่าใช้จ่ายในการแก้ไข: _____ บาท
   │   └── ค่าเสียโอกาส: _____ บาท
   ├── Operational Impact:
   │   ├── พนักงานกี่คนทำงานไม่ได้?
   │   ├── กระบวนการอะไรหยุดชะงัก?
   │   └── มี manual workaround ได้ไหม?
   ├── Reputational Impact:
   │   ├── ลูกค้าได้รับผลกระทบไหม?
   │   ├── มีข่าวในสื่อไหม?
   │   └── ระดับ: ต่ำ / ปานกลาง / สูง / วิกฤต
   └── Legal/Regulatory Impact:
       ├── ละเมิด PDPA ไหม? (ข้อมูลส่วนบุคคลรั่วไหล)
       ├── ละเมิด compliance ไหม? (PCI DSS, ISO 27001)
       └── ต้องรายงานหน่วยงานกำกับไหม?

3. กำหนด Priority:
   ├── Tier 1 (Mission Critical): ต้องกู้คืนภายใน 1 ชั่วโมง
   ├── Tier 2 (Business Critical): ต้องกู้คืนภายใน 4 ชั่วโมง
   ├── Tier 3 (Important): ต้องกู้คืนภายใน 24 ชั่วโมง
   └── Tier 4 (Non-Critical): กู้คืนภายใน 72 ชั่วโมง

4. Dependencies:
   ├── ระบบนี้ depend on อะไรบ้าง? (DB, network, DNS, etc.)
   ├── ระบบอื่นอะไร depend on ระบบนี้?
   └── ลำดับการ recover: ต้อง recover A ก่อน B, B ก่อน C

3.2 Disaster Types — ประเภทภัยพิบัติ

Types of Disasters:

1. Natural Disasters (ภัยธรรมชาติ):
   ├── น้ำท่วม — พบบ่อยในประเทศไทย โดยเฉพาะกรุงเทพและปริมณฑล
   ├── พายุ / ลมแรง — ทำให้สายไฟฟ้าขาด, เสาสัญญาณล้ม
   ├── แผ่นดินไหว — พื้นที่ภาคเหนือ
   ├── ไฟไหม้ — จากไฟฟ้าลัดวงจร, อุปกรณ์ร้อนเกินไป
   └── ฟ้าผ่า — ทำลาย UPS, Switch, Router

2. Cyber Attacks (การโจมตีทางไซเบอร์):
   ├── Ransomware — เข้ารหัสไฟล์ เรียกค่าไถ่
   │   ├── ปี 2025 มีองค์กรไทยหลายแห่งถูกโจมตี
   │   └── ค่าเฉลี่ย downtime: 21 วัน
   ├── DDoS — ทำให้เว็บไซต์/บริการล่ม
   ├── Data breach — ข้อมูลรั่วไหล (ผิด PDPA)
   ├── Supply chain attack — vendor ถูกแฮก กระทบเรา
   └── Insider threat — พนักงานตั้งใจหรือไม่ตั้งใจทำลายข้อมูล

3. Hardware Failure (อุปกรณ์เสีย):
   ├── Hard drive / SSD failure
   ├── RAID controller failure
   ├── Power supply failure
   ├── Network equipment failure (switch, router)
   ├── Air conditioning failure → overheating → cascade failure
   └── UPS battery failure

4. Human Error (ความผิดพลาดของคน):
   ├── ลบข้อมูลผิด (rm -rf / , DROP TABLE)
   ├── Configuration ผิดพลาด (firewall rule, routing)
   ├── Deploy code ที่มี bug ไป production
   ├── ทำ cable หลุด / ชนอุปกรณ์
   └── ลืมต่ออายุ domain, certificate, license

5. Infrastructure Failure:
   ├── ไฟดับเป็นเวลานาน (UPS หมด, generator ไม่ทำงาน)
   ├── Internet outage (ISP ล่ม)
   ├── Cloud provider outage (AWS, Azure region down)
   ├── Cooling system failure
   └── Building เข้าไม่ได้ (อัคคีภัย, สถานการณ์ฉุกเฉิน)

ส่วนที่ 4: DR Strategies — กลยุทธ์การกู้คืน

4.1 DR Strategy Tiers

มี DR strategy หลายระดับ ตั้งแต่ประหยัดที่สุด (Backup/Restore) ไปจนถึงแพงที่สุด (Active-Active) การเลือก strategy ที่เหมาะสมขึ้นอยู่กับ RTO/RPO ที่ต้องการ และงบประมาณที่มี:

DR Strategy Tiers (เรียงจากประหยัดไปแพง):

Tier 1: Backup & Restore
├── วิธี: backup ข้อมูลไปเก็บ offsite, เมื่อเกิดเหตุ → restore จาก backup
├── RTO: 24-72 ชั่วโมง (ขึ้นกับขนาดข้อมูลและ bandwidth)
├── RPO: 24 ชั่วโมง (ถ้า backup วันละครั้ง)
├── ค่าใช้จ่าย: ★☆☆☆☆ (ต่ำสุด)
├── เหมาะกับ: ระบบ non-critical, SMB
├── ข้อดี: ง่าย, ถูก, เข้าใจง่าย
├── ข้อเสีย: RTO สูง, ต้องจัดหา hardware ใหม่ (ถ้า DC พัง)
└── ตัวอย่าง: Veeam backup → offsite NAS หรือ cloud storage

Tier 2: Pilot Light
├── วิธี: infrastructure พื้นฐานพร้อมอยู่ที่ DR site แต่ shutdown อยู่
│   ├── Database replicated (async)
│   ├── AMI/VM template พร้อม
│   └── DNS, network config พร้อม
├── เมื่อเกิดเหตุ → start VMs, scale up, switch DNS
├── RTO: 4-8 ชั่วโมง
├── RPO: 1-4 ชั่วโมง
├── ค่าใช้จ่าย: ★★☆☆☆ (ต่ำ-ปานกลาง)
├── เหมาะกับ: ระบบที่ RTO ไม่เข้มงวดมาก
├── ข้อดี: ประหยัดค่า compute (ปิด VM ไว้)
└── ข้อเสีย: ต้อง start + configure เมื่อเกิดเหตุ

Tier 3: Warm Standby
├── วิธี: DR site มีระบบ running อยู่ แต่ไม่ full capacity
│   ├── Database replicated (sync หรือ near-sync)
│   ├── Application servers running (แต่ smaller scale)
│   └── พร้อม scale up เมื่อต้องการ
├── เมื่อเกิดเหตุ → scale up + switch traffic
├── RTO: 1-4 ชั่วโมง
├── RPO: นาที - 1 ชั่วโมง
├── ค่าใช้จ่าย: ★★★☆☆ (ปานกลาง)
├── เหมาะกับ: ระบบ business critical
├── ข้อดี: RTO เร็วกว่า pilot light มาก
└── ข้อเสีย: ค่า running cost สูงกว่า (เพราะ VM ต้อง running)

Tier 4: Hot Standby
├── วิธี: DR site มีระบบ running เต็ม capacity เหมือน production
│   ├── Database: synchronous replication
│   ├── Application: fully configured & running
│   ├── Network: ready to receive traffic
│   └── Data: near-zero data loss
├── เมื่อเกิดเหตุ → switch DNS/load balancer (manual failover)
├── RTO: 15-60 นาที
├── RPO: 0-15 นาที
├── ค่าใช้จ่าย: ★★★★☆ (สูง)
├── เหมาะกับ: ระบบ mission critical
├── ข้อดี: RTO ต่ำมาก, data loss น้อยมาก
└── ข้อเสีย: ค่าใช้จ่ายเกือบ 2x (running full duplicate)

Tier 5: Active-Active (Multi-Site)
├── วิธี: ทั้ง 2 site (หรือมากกว่า) รับ traffic พร้อมกัน
│   ├── Load balancer กระจาย traffic ไปทั้ง 2 site
│   ├── Database: multi-master replication
│   ├── Application: running เต็ม capacity ทั้ง 2 site
│   └── ถ้า 1 site ล่ม → อีก site รับ traffic ทั้งหมด (auto)
├── เมื่อเกิดเหตุ → automatic failover (อาจไม่มี manual step เลย)
├── RTO: 0-5 นาที (near-zero)
├── RPO: 0 (zero data loss)
├── ค่าใช้จ่าย: ★★★★★ (สูงมาก)
├── เหมาะกับ: financial services, e-commerce ขนาดใหญ่
├── ข้อดี: near-zero downtime, zero data loss
└── ข้อเสีย: แพงมาก, ซับซ้อนในการ manage (especially database)

4.2 DR Strategies Visual Comparison

DR Strategy Cost vs RTO:

ค่าใช้จ่าย
    ▲
    │
    │    ★ Active-Active
    │   ╱
    │  ╱  ★ Hot Standby
    │ ╱  ╱
    │╱  ╱   ★ Warm Standby
    │  ╱   ╱
    │ ╱   ╱    ★ Pilot Light
    │╱   ╱    ╱
    │   ╱    ╱     ★ Backup/Restore
    │──╱────╱─────╱───────────────▶ RTO
    0  15min 1hr  4hr  24hr 72hr

ส่วนที่ 5: DR สำหรับ On-Premise

5.1 Replication Technologies

On-Premise DR Technologies:

1. Storage Replication:
   ├── Synchronous Replication:
   │   ├── ข้อมูลถูกเขียนทั้ง 2 site พร้อมกัน
   │   ├── RPO = 0 (zero data loss)
   │   ├── ข้อจำกัด: distance ≤ 100 km (latency sensitive)
   │   ├── ตัวอย่าง: NetApp MetroCluster, Dell EMC SRDF/S
   │   └── ต้องมี dedicated fiber link ระหว่าง site
   │
   └── Asynchronous Replication:
       ├── ข้อมูลถูก replicate ไป DR site แบบ delayed
       ├── RPO = seconds ถึง minutes (ขึ้นกับ RPO setting)
       ├── ไม่จำกัด distance
       ├── ตัวอย่าง: NetApp SnapMirror, Dell EMC SRDF/A
       └── ใช้ bandwidth น้อยกว่า sync

2. Database Replication:
   ├── SQL Server Always On (Availability Groups):
   │   ├── Synchronous commit: RPO = 0
   │   ├── Asynchronous commit: RPO = seconds
   │   ├── Automatic failover (sync mode)
   │   └── แนะนำ: sync สำหรับ local DR, async สำหรับ remote DR
   │
   ├── MySQL/MariaDB Replication:
   │   ├── Master-Slave replication
   │   ├── Group Replication (multi-master)
   │   ├── Galera Cluster (sync multi-master)
   │   └── MySQL InnoDB Cluster
   │
   ├── PostgreSQL:
   │   ├── Streaming Replication (async/sync)
   │   ├── Logical Replication
   │   └── Patroni (HA + auto-failover)
   │
   └── Oracle Data Guard:
       ├── Physical Standby (block-level replication)
       ├── Logical Standby (SQL apply)
       ├── Active Data Guard (read-only queries บน standby)
       └── Far Sync (zero data loss at any distance)

3. VM Replication:
   ├── VMware vSphere Replication:
   │   ├── RPO: 5 นาที ถึง 24 ชั่วโมง
   │   ├── ทำงานที่ VM level
   │   ├── ไม่ต้อง shared storage
   │   └── ใช้ร่วมกับ Site Recovery Manager (SRM)
   │
   ├── Veeam Backup & Replication:
   │   ├── ทั้ง backup และ replication
   │   ├── CDP (Continuous Data Protection) — RPO seconds
   │   ├── Instant Recovery — start VM จาก backup file
   │   └── Sure Recovery — automated DR testing
   │
   ├── Zerto:
   │   ├── Continuous replication — RPO = seconds
   │   ├── Journal-based recovery — point-in-time restore
   │   ├── Non-disruptive DR testing
   │   └── Multi-cloud support
   │
   └── Proxmox Backup Server:
       ├── Incremental backup (fast)
       ├── Deduplication + compression
       ├── Client-side encryption
       └── Web UI + REST API

5.2 Clustering for HA

Clustering Technologies:

1. Windows Server Failover Cluster (WSFC):
   ├── Active-Passive: 1 node active, 1 node standby
   ├── Active-Active: ทั้ง 2 node active (different workloads)
   ├── ต้องมี shared storage (SAN, S2D)
   ├── Quorum: disk witness, file share witness, cloud witness
   └── ใช้สำหรับ: SQL Server, File Server, Hyper-V

2. Linux HA (Pacemaker + Corosync):
   ├── Pacemaker: cluster resource manager
   ├── Corosync: cluster communication
   ├── DRBD: distributed replicated block device
   ├── Resource agents: ควบคุม services (Apache, MySQL, etc.)
   └── ใช้สำหรับ: web servers, database, application servers

3. VMware vSphere HA:
   ├── Restart VMs บน host อื่นเมื่อ host ล่ม
   ├── Auto-detect host failure
   ├── Application monitoring (restart unresponsive VMs)
   ├── Admission control: รับรอง capacity เพียงพอ
   └── ไม่ใช่ DR — เป็น HA ภายใน cluster เดียว

4. Stretched Cluster:
   ├── Cluster ที่ span ข้าม 2 site
   ├── VMware vSAN Stretched Cluster
   ├── Nutanix Metro Availability
   ├── ต้องมี witness node ที่ 3rd site
   └── ข้อจำกัด: latency ระหว่าง site ≤ 5ms RTT

ส่วนที่ 6: DR ใน Cloud

6.1 AWS Disaster Recovery

AWS DR Services & Strategies:

1. AWS Elastic Disaster Recovery (DRS):
   ├── เดิมคือ CloudEndure Disaster Recovery
   ├── Continuous replication จาก on-premise ไป AWS
   ├── RPO: seconds (continuous replication)
   ├── RTO: minutes (launch recovery instances)
   ├── ทำงานที่ block level — ไม่ต้องเปลี่ยน app
   ├── Non-disruptive DR testing
   ├── ค่าใช้จ่ายต่ำ — จ่ายเฉพาะ staging area (small instances)
   └── เหมาะสำหรับ: on-premise → AWS DR

2. AWS Cross-Region DR:
   ├── S3 Cross-Region Replication → data replication
   ├── RDS Cross-Region Read Replica → database DR
   ├── Aurora Global Database → multi-region, RPO < 1 sec
   ├── DynamoDB Global Tables → multi-region, auto-replication
   ├── EBS Snapshots → copy to another region
   ├── AMI copy → replicate VM images cross-region
   └── Route 53 health check → DNS failover

3. AWS Multi-AZ (HA within region):
   ├── RDS Multi-AZ: sync replication, auto-failover
   ├── EC2 Auto Scaling across AZs
   ├── ELB cross-zone load balancing
   ├── S3: auto-replicate across 3+ AZs
   └── หมายเหตุ: Multi-AZ ≠ DR (same region)

AWS DR Architecture Example:

Region: ap-southeast-1 (Singapore - Primary)
├── VPC → EC2, RDS, ElastiCache
├── S3 → application data
└── Route 53 → primary DNS

        ↕ Replication (async)

Region: ap-northeast-1 (Tokyo - DR)
├── VPC → standby EC2 (stopped), RDS read replica
├── S3 → replicated data
└── Route 53 → failover DNS (health check)

6.2 Azure Site Recovery (ASR)

Azure Site Recovery:

Overview:
├── DR-as-a-Service จาก Microsoft
├── Replicate VMs (VMware, Hyper-V, Physical, Azure-to-Azure)
├── Continuous replication → RPO = seconds
├── Automated recovery plans → RTO = minutes
├── Non-disruptive DR testing (test failover)
└── Pay only for: storage + replicated data + compute during DR

ASR Scenarios:

1. VMware → Azure:
   ├── Install: Azure Site Recovery appliance (on-premise)
   ├── Continuous replication ไปยัง Azure (managed disks)
   ├── เมื่อเกิดเหตุ → failover → VMs boot ใน Azure
   ├── เมื่อ primary กลับมา → failback → replicate กลับ
   └── เหมาะกับ: on-premise VMware DR ไปยัง Azure

2. Azure → Azure:
   ├── Replicate Azure VMs ข้าม region
   ├── เช่น Southeast Asia → East Asia
   ├── Auto-replicate: VMs, managed disks, networking
   └── เหมาะกับ: Azure-native applications

3. Hyper-V → Azure:
   ├── Hyper-V Replica → Azure
   ├── ใช้ Azure Site Recovery Provider
   └── เหมาะกับ: Windows Server / Hyper-V environment

ASR Recovery Plan:
├── Group 1: Start DNS + AD Domain Controllers
├── Group 2: Start Database servers
├── Group 3: Start Application servers
├── Group 4: Start Web servers
├── Custom scripts: update connection strings, DNS, etc.
└── Test failover → ทดสอบโดยไม่กระทบ production

ส่วนที่ 7: DR Testing — การทดสอบแผน DR

7.1 ทำไมต้องทดสอบ?

แผน DR ที่ไม่เคยทดสอบ = ไม่มีแผน DR จากสถิติพบว่ากว่า 40% ขององค์กรที่มีแผน DR แต่ไม่เคยทดสอบ เมื่อเกิดเหตุจริงกลับพบว่าแผนใช้งานไม่ได้ — backup restore ไม่สำเร็จ, password ของ service account เปลี่ยนไป, network configuration ผิด, application dependency ขาดหาย

7.2 ประเภทของ DR Test

DR Testing Types (เรียงจากง่ายไปยาก):

1. Tabletop Exercise (สัมมนาบนโต๊ะ):
   ├── รูปแบบ: ประชุมร่วมกัน, สมมติสถานการณ์, อภิปราย
   ├── ผู้เข้าร่วม: IT team, management, stakeholders
   ├── ระยะเวลา: 2-4 ชั่วโมง
   ├── ผลกระทบ production: ไม่มี
   ├── ค่าใช้จ่าย: ต่ำมาก (แค่เวลาคน)
   ├── ตัวอย่าง: "สมมติว่า DC หลักถูกน้ำท่วม เราจะทำอะไรบ้าง?"
   ├── ประเมิน: ทุกคนรู้บทบาทไหม? แผนครบไหม? มีช่องว่างตรงไหน?
   └── ความถี่: ทุก 3 เดือน

2. Walkthrough Test (เดินตามขั้นตอน):
   ├── รูปแบบ: ทีม IT เดินตามขั้นตอนใน DRP ทีละ step
   ├── ไม่ได้ทำจริง — แค่ verify ว่าขั้นตอนถูกต้องและอัปเดต
   ├── ตรวจสอบ: credentials ยังใช้ได้ไหม? contact list ถูกต้องไหม?
   ├── ผลกระทบ production: ไม่มี
   ├── ค่าใช้จ่าย: ต่ำ
   └── ความถี่: ทุก 3 เดือน

3. Simulation Test (จำลองสถานการณ์):
   ├── รูปแบบ: จำลองเหตุการณ์จริง แต่ไม่กระทบ production
   ├── ตัวอย่าง: restore backup ไปยัง test server แล้วตรวจสอบ
   ├── ตัวอย่าง: start VMs ที่ DR site ใน isolated network
   ├── ผลกระทบ production: น้อยมาก (ใช้ test environment)
   ├── ค่าใช้จ่าย: ปานกลาง (ค่า compute สำหรับ test VMs)
   ├── ตรวจสอบ: backup restore ได้จริงไหม? application ทำงานได้ไหม?
   └── ความถี่: ทุก 6 เดือน

4. Parallel Test (ทดสอบแบบขนาน):
   ├── รูปแบบ: เปิด DR site ขึ้นมาจริง, run ทดสอบแบบ parallel กับ production
   ├── Production ยังทำงานปกติ
   ├── ตรวจสอบ: DR site ทำงานได้เหมือน production ไหม?
   ├── ทดสอบ: performance, data integrity, application functionality
   ├── ผลกระทบ production: น้อย (แต่อาจกระทบ bandwidth)
   ├── ค่าใช้จ่าย: สูง (run full DR environment)
   └── ความถี่: ทุก 6-12 เดือน

5. Full Interruption Test (ทดสอบเต็มรูปแบบ):
   ├── รูปแบบ: ปิด production site จริง → ย้ายทุกอย่างไป DR site
   ├── เป็นการทดสอบที่สมจริงที่สุด
   ├── ผลกระทบ production: สูงมาก (planned downtime)
   ├── ค่าใช้จ่าย: สูงมาก
   ├── ความเสี่ยง: ถ้า DR ไม่ work → extended outage
   ├── ต้องทำนอกเวลาทำการ (เช่น วันหยุดยาว)
   └── ความถี่: ปีละครั้ง (ถ้า organization maturity สูงพอ)

7.3 DR Test Report Template

DR Test Report:

Test Information:
├── Test Date: _____________
├── Test Type: Tabletop / Walkthrough / Simulation / Full
├── Scenario: _____________
├── Duration: _____ hours
├── Participants: _____________

Results:
├── RPO Achieved: _____ (target: _____)
├── RTO Achieved: _____ (target: _____)
├── Data Integrity: Pass / Fail
├── Application Functionality: Pass / Fail
├── Network Connectivity: Pass / Fail
├── User Access: Pass / Fail

Issues Found:
├── Issue 1: _________ | Severity: High/Medium/Low
├── Issue 2: _________ | Severity: High/Medium/Low
└── Issue 3: _________ | Severity: High/Medium/Low

Action Items:
├── Action 1: _________ | Owner: _____ | Due: _____
├── Action 2: _________ | Owner: _____ | Due: _____
└── Action 3: _________ | Owner: _____ | Due: _____

Lessons Learned:
├── What went well: _____________
├── What didn't go well: _____________
└── What to improve: _____________

Sign-off:
├── IT Manager: _________ Date: _____
└── CTO/Director: _________ Date: _____

ส่วนที่ 8: DR Documentation — เอกสารที่ต้องมี

8.1 DR Plan Document Structure

DR Plan Document Template:

1. Executive Summary
   ├── วัตถุประสงค์ของแผน DR
   ├── ขอบเขต (ระบบที่ครอบคลุม)
   ├── สรุป RTO/RPO targets
   └── วันที่อัปเดตล่าสุด

2. Roles & Responsibilities
   ├── DR Coordinator: _____ (เบอร์โทร, email)
   ├── IT Infrastructure Lead: _____
   ├── Database Lead: _____
   ├── Application Lead: _____
   ├── Network Lead: _____
   ├── Communication Lead: _____
   ├── Vendor Contact List:
   │   ├── ISP: _____ (contract #, support #)
   │   ├── Hardware vendor: _____
   │   ├── Software vendor: _____
   │   ├── Cloud provider: _____
   │   └── DR site provider: _____
   └── Escalation Matrix:
       ├── Level 1: IT Team → 15 min response
       ├── Level 2: IT Manager → 30 min response
       ├── Level 3: CTO → 1 hour response
       └── Level 4: CEO → 2 hour response

3. Disaster Declaration Process
   ├── ใครมีอำนาจประกาศ disaster? (IT Manager + CTO)
   ├── เกณฑ์ในการประกาศ:
   │   ├── Production DC ไม่สามารถ access ได้
   │   ├── Multiple critical systems down > 1 hour
   │   ├── Data corruption detected
   │   └── Ransomware infection confirmed
   └── ขั้นตอนการประกาศ: assess → declare → activate DR plan

4. Communication Plan
   ├── Internal Communication:
   │   ├── IT Team: MS Teams / Line group
   │   ├── Management: email + phone call
   │   ├── All staff: email + intranet announcement
   │   └── Template message สำหรับแต่ละ audience
   ├── External Communication:
   │   ├── Customers: email + website banner
   │   ├── Partners/Vendors: email + phone
   │   ├── Media: PR team ดูแล (ถ้าจำเป็น)
   │   └── Regulators: ตามข้อกำหนด (PDPA, BOT, etc.)
   └── Status Update: ทุก 1 ชั่วโมง จนกว่าจะ resolve

5. System Inventory
   ├── รายการระบบทั้งหมด + priority tier
   ├── Dependencies map
   ├── Recovery order
   └── Owner ของแต่ละระบบ

6. Recovery Procedures (Runbooks)
   ├── แต่ละระบบมี runbook แยก
   ├── ขั้นตอน step-by-step
   ├── screenshots / commands
   └── verification steps

7. DR Site Information
   ├── Location, access procedure
   ├── Network diagram
   ├── Hardware inventory
   ├── Capacity
   └── Contact information

8. Test Schedule & Results
   ├── Annual test calendar
   ├── Past test results
   └── Open action items

9. Maintenance Schedule
   ├── Document review: ทุก 3 เดือน
   ├── Contact list update: ทุกเดือน
   ├── DR test: ตาม schedule
   └── Plan update triggers:
       ├── Major infrastructure change
       ├── New critical system deployment
       ├── Organization restructure
       └── After any actual disaster

8.2 Runbook Creation

Runbook คือเอกสารที่มีขั้นตอน step-by-step สำหรับกู้คืนระบบแต่ละตัว ต้องละเอียดพอที่คนที่ไม่เคยทำมาก่อนสามารถทำตามได้ (เพราะในสถานการณ์จริง คนที่ responsibility อาจไม่อยู่):

Runbook Example: Database Server Recovery

System: SQL-PROD-01 (SQL Server 2022)
Priority: Tier 1 (Mission Critical)
RTO: 1 hour | RPO: 15 minutes
Owner: DBA Team ([email protected], ext. 1234)

Pre-requisites:
├── DR server: SQL-DR-01 (192.168.100.10)
├── Latest backup location: \nas-dr\sqlbackups├── Credentials: stored in password manager (vault.corp.local)
├── Network: DR VLAN 100 must be active
└── DNS: sql-prod.corp.local → update to DR IP

Recovery Steps:

Step 1: Verify DR server is accessible
├── RDP to SQL-DR-01 (192.168.100.10)
├── Login: CORP\sql-admin (password in vault)
├── Verify SQL Server service is running
└── Expected: SQL Server Management Studio opens successfully

Step 2: Check replication status
├── Open SSMS → Always On → Dashboard
├── Check synchronization state
├── If synchronized → proceed to Step 3
├── If NOT synchronized → proceed to Step 2b
│
├── Step 2b: Manual restore from backup
│   ├── Navigate to \nas-dr\sqlbackups│   ├── Find latest full backup + differential + transaction logs
│   ├── RESTORE DATABASE [ProductionDB]
│   │   FROM DISK = 'latest_full.bak'
│   │   WITH NORECOVERY
│   ├── RESTORE DATABASE [ProductionDB]
│   │   FROM DISK = 'latest_diff.bak'
│   │   WITH NORECOVERY
│   ├── RESTORE LOG [ProductionDB]
│   │   FROM DISK = 'latest_log.trn'
│   │   WITH RECOVERY
│   └── Verify: SELECT COUNT(*) FROM critical_table

Step 3: Failover Always On Availability Group
├── SSMS → Right-click AG → Failover
├── Select SQL-DR-01 as new primary
├── Confirm data loss acknowledgment (if async)
└── Verify: AG dashboard shows SQL-DR-01 as primary

Step 4: Update DNS
├── DNS Manager → sql-prod.corp.local
├── Update A record → 192.168.100.10
├── Set TTL to 60 seconds (temporary)
├── Flush DNS on key servers: ipconfig /flushdns
└── Verify: nslookup sql-prod.corp.local → 192.168.100.10

Step 5: Verify application connectivity
├── Test connection from app server
├── Check connection string in web.config
├── Verify: application loads data correctly
├── Test: create test record → verify → delete
└── Monitor error logs for 15 minutes

Step 6: Notify stakeholders
├── Update DR status channel: "SQL Server recovered"
├── Notify application owners
└── Log recovery time in DR log

Verification Checklist:
├── [ ] Database accessible from application
├── [ ] Data integrity verified (row counts, checksums)
├── [ ] No error in SQL Server error log
├── [ ] Application functioning normally
├── [ ] Users can login and work
└── [ ] Backup job reconfigured for DR server

Rollback (when primary site restored):
├── Step 1: Rebuild AG with original primary
├── Step 2: Seed database to original primary
├── Step 3: Failover back to original primary
├── Step 4: Update DNS back to original IP
└── Step 5: Verify and close DR event

ส่วนที่ 9: DR Metrics, KPIs และ Compliance

9.1 DR Metrics & KPIs

Key DR Metrics to Track:

1. RTO Achievement Rate:
   ├── วัด: จำนวนระบบที่ recover ภายใน RTO / ทั้งหมด
   ├── เป้าหมาย: 100%
   └── ถ้าไม่ถึง → ปรับปรุง DR strategy หรือ revise RTO

2. RPO Achievement Rate:
   ├── วัด: data loss จริง vs RPO target
   ├── เป้าหมาย: actual data loss ≤ RPO target
   └── ถ้าไม่ถึง → เพิ่มความถี่ backup/replication

3. DR Test Success Rate:
   ├── วัด: จำนวน test ที่สำเร็จ / ทั้งหมด
   ├── เป้าหมาย: > 90%
   └── Track issues found per test → should decrease over time

4. Mean Time to Recover (MTTR):
   ├── วัด: เวลาเฉลี่ยในการกู้คืนระบบ
   ├── แยกตาม: system type, disaster type
   └── Trend: ควรลดลงเมื่อ process improve

5. Backup Success Rate:
   ├── วัด: จำนวน backup jobs ที่สำเร็จ / ทั้งหมด
   ├── เป้าหมาย: > 99%
   └── Alert: ทุกครั้งที่ backup fail

6. DR Plan Currency:
   ├── วัด: วันที่อัปเดตล่าสุดของ DR plan
   ├── เป้าหมาย: อัปเดตภายใน 3 เดือนล่าสุด
   └── Alert: ถ้าเกิน 3 เดือน → flag สำหรับ review

7. DR Budget Utilization:
   ├── วัด: ค่าใช้จ่าย DR จริง vs งบประมาณ
   ├── รวม: DR site cost, replication licenses, test costs
   └── Benchmark: DR cost ควรอยู่ที่ 2-10% ของ IT budget

9.2 Compliance Requirements

DR-Related Compliance Requirements:

1. PDPA (Thailand):
   ├── ต้องมีมาตรการป้องกันข้อมูลส่วนบุคคล
   ├── ต้องมีแผนรับมือเมื่อข้อมูลรั่วไหล
   ├── ต้องแจ้ง สคส. ภายใน 72 ชั่วโมงเมื่อเกิด data breach
   └── ต้องมี backup ที่เข้ารหัส

2. ISO 27001 (Information Security):
   ├── A.17: Information security aspects of BCM
   │   ├── A.17.1.1: Planning information security continuity
   │   ├── A.17.1.2: Implementing continuity
   │   └── A.17.1.3: Verify, review, evaluate continuity
   └── ต้อง test DR plan เป็นประจำ

3. PCI DSS (Payment Card):
   ├── Requirement 9.5: Protect media with cardholder data
   ├── Requirement 12.10: Incident response plan
   ├── ต้อง test DR plan อย่างน้อยปีละครั้ง
   └── Backup ต้องเข้ารหัส + เก็บ offsite

4. BOT Guidelines (ธนาคารแห่งประเทศไทย):
   ├── สถาบันการเงินต้องมี DR plan
   ├── DR site ต้องอยู่ห่างจาก primary site > 50 km
   ├── ต้องทดสอบ DR อย่างน้อยปีละ 2 ครั้ง
   ├── RTO สำหรับ core banking ≤ 4 ชั่วโมง
   └── ต้องมี BCP Committee ระดับ board

5. SEC (กลต.) — สำหรับบริษัทจดทะเบียน:
   ├── ต้องมีระบบ IT ที่เชื่อถือได้
   ├── ต้องมี BCM/DR plan
   └── ต้องเปิดเผย IT risk ใน annual report

ส่วนที่ 10: DR Budget Planning

10.1 DR Cost Components

DR Budget Breakdown:

1. DR Site Costs:
   ├── Co-location: ค่าเช่า rack, power, cooling
   │   ├── Bangkok: 15,000-50,000 บาท/rack/เดือน
   │   └── ต่างจังหวัด: 10,000-30,000 บาท/rack/เดือน
   ├── Cloud DR: ค่า compute, storage, bandwidth
   │   ├── AWS: EC2 reserved + S3 storage
   │   ├── Azure: ASR license + storage
   │   └── ประมาณ 20-40% ของ production cloud cost
   └── In-house DR: ค่าอาคาร, UPS, cooling, internet

2. Hardware/Software:
   ├── DR servers (ถ้า on-premise)
   ├── Storage (replication target)
   ├── Network equipment
   ├── Replication software (Veeam, Zerto, etc.)
   ├── DR orchestration tools
   └── Monitoring tools

3. Operational Costs:
   ├── WAN link ระหว่าง sites
   ├── Staff time สำหรับ DR management
   ├── DR testing costs (ค่า compute สำหรับ test)
   ├── Training costs
   └── Consultant/vendor support

4. Hidden Costs (มักถูกมองข้าม):
   ├── License ที่ต้องมี 2 ชุด (production + DR)
   ├── Certificate/domain renewal สำหรับ DR site
   ├── Data transfer costs (egress fees in cloud)
   ├── DR plan maintenance time
   └── Staff overtime during DR tests

DR Budget Rule of Thumb:
├── Tier 1 systems (Active-Active): 80-100% ของ production cost
├── Tier 2 systems (Hot Standby): 50-80% ของ production cost
├── Tier 3 systems (Warm Standby): 30-50% ของ production cost
├── Tier 4 systems (Pilot Light): 15-30% ของ production cost
└── Tier 5 systems (Backup/Restore): 5-15% ของ production cost

ส่วนที่ 11: Lessons Learned จากเหตุการณ์จริง

11.1 กรณีศึกษา

Real-World DR Lessons:

Case 1: น้ำท่วมใหญ่ 2554 (ประเทศไทย)
├── ผลกระทบ: หลาย data center ในนิคมอุตสาหกรรมถูกน้ำท่วม
├── บทเรียน:
│   ├── Backup ที่อยู่ใน site เดียวกัน = ไม่มี backup
│   ├── DR site ต้องอยู่คนละ flood zone
│   ├── Physical security includes environmental risks
│   └── Cloud DR eliminates geographic single point of failure
└── ผลลัพธ์: หลายองค์กรเริ่ม adopt cloud DR หลังเหตุการณ์นี้

Case 2: Ransomware Attack — โรงพยาบาล
├── ผลกระทบ: ระบบ HIS ล่มทั้งหมด, ผู้ป่วยต้องย้าย
├── บทเรียน:
│   ├── Backup ที่ connected กับ network = ransomware เข้ารหัสได้
│   ├── Air-gapped backup (offline) เป็นสิ่งจำเป็น
│   ├── 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite
│   ├── Immutable backup (ไม่สามารถลบ/แก้ไขได้)
│   └── ต้องมี incident response plan ร่วมกับ DR plan
└── ผลลัพธ์: recovery ใช้เวลาหลายสัปดาห์

Case 3: Cloud Provider Outage
├── ผลกระทบ: entire region down > 12 ชั่วโมง
├── บทเรียน:
│   ├── Single cloud region ≠ high availability
│   ├── Multi-region หรือ multi-cloud สำหรับ critical systems
│   ├── มี manual workaround plan เมื่อ cloud ล่ม
│   └── อย่า assume ว่า cloud = never down
└── ผลลัพธ์: เกิด trend multi-cloud DR strategy

Case 4: Human Error — ลบ Production Database
├── ผลกระทบ: DROP DATABASE production_db (by accident)
├── บทเรียน:
│   ├── Principle of least privilege — DBA ไม่ควรมี DROP สิทธิ์ใน prod
│   ├── Point-in-time recovery (PITR) ช่วยชีวิต
│   ├── Delayed replica (เช่น 1 hour delay) ช่วยได้
│   ├── Change management process ลด human error
│   └── ทดสอบ restore procedure ก่อนเกิดเหตุ
└── ผลลัพธ์: implement PITR + delayed replica + RBAC

ส่วนที่ 12: DR Automation

12.1 Infrastructure as Code (IaC) สำหรับ DR

DR Automation with IaC:

1. Terraform สำหรับ DR Infrastructure:

# main.tf — DR site infrastructure
provider "aws" {
  alias  = "dr"
  region = "ap-northeast-1"  # Tokyo (DR region)
}

# VPC for DR
resource "aws_vpc" "dr_vpc" {
  provider   = aws.dr
  cidr_block = "10.1.0.0/16"
  tags = { Name = "DR-VPC" }
}

# DB subnet group
resource "aws_db_subnet_group" "dr_db" {
  provider   = aws.dr
  name       = "dr-db-subnet"
  subnet_ids = [aws_subnet.dr_private_1.id, aws_subnet.dr_private_2.id]
}

# RDS Read Replica (cross-region DR)
resource "aws_db_instance" "dr_replica" {
  provider             = aws.dr
  replicate_source_db  = aws_db_instance.primary.arn
  instance_class       = "db.r6g.large"
  storage_encrypted    = true
  multi_az             = true
  tags = { Name = "DR-Database-Replica" }
}

# DR EC2 instances (stopped — pilot light)
resource "aws_instance" "dr_app" {
  provider      = aws.dr
  count         = 2
  ami           = data.aws_ami.dr_app.id
  instance_type = "m6i.large"

  # Start stopped (pilot light mode)
  # Will be started during DR activation

  tags = { Name = "DR-App-${count.index + 1}" }
}

2. Ansible Playbook สำหรับ DR Activation:

# dr_activate.yml
---
- name: Activate DR Environment
  hosts: dr_servers
  become: yes
  tasks:
    - name: Start application services
      systemd:
        name: "{{ item }}"
        state: started
        enabled: yes
      loop:
        - nginx
        - app-server
        - redis

    - name: Update database connection string
      template:
        src: db_config.j2
        dest: /etc/app/database.yml
      notify: restart app-server

    - name: Verify application health
      uri:
        url: "http://localhost:8080/health"
        return_content: yes
      register: health_check
      until: health_check.status == 200
      retries: 10
      delay: 30

    - name: Update DNS via Route53
      route53:
        state: present
        zone: "corp.example.com"
        record: "app.corp.example.com"
        type: A
        value: "{{ dr_server_ip }}"
        ttl: 60
        overwrite: yes

3. DR Automation Pipeline:

DR Event Detected
       │
       ▼
Automated Assessment
├── Check: primary site reachable?
├── Check: database replication lag?
├── Check: network connectivity?
└── Report: severity level
       │
       ▼
Human Decision: DECLARE DISASTER
       │
       ▼
Automated DR Activation
├── Step 1: Terraform apply → start DR instances
├── Step 2: Ansible → configure services
├── Step 3: Database → promote replica
├── Step 4: DNS → update records
├── Step 5: Load balancer → redirect traffic
├── Step 6: Monitoring → verify health
└── Step 7: Notification → inform stakeholders
       │
       ▼
DR Active → Monitor → Plan Failback

12.2 DR Orchestration Tools

DR Orchestration Tools:

1. VMware Site Recovery Manager (SRM):
   ├── Automated DR orchestration สำหรับ VMware
   ├── Recovery plans: ลำดับ start VMs, run scripts
   ├── Non-disruptive testing
   ├── IP customization (change IP ตอน failover)
   ├── Integration: vSphere Replication, NetApp, Dell EMC
   └── ราคา: per-VM license

2. Zerto:
   ├── Continuous replication — RPO = seconds
   ├── Journal-based recovery — PITR
   ├── Multi-cloud (VMware ↔ AWS ↔ Azure)
   ├── Automated failover + failback
   ├── Non-disruptive testing (VPG test)
   └── ราคา: per-VM subscription

3. Veeam Disaster Recovery Orchestrator:
   ├── Orchestrate Veeam backup-based DR
   ├── Automated DR testing + reporting
   ├── Recovery verification
   ├── Compliance documentation auto-generated
   └── Integration: Veeam B&R, cloud

4. AWS Elastic Disaster Recovery:
   ├── Continuous replication → AWS
   ├── Point-in-time recovery
   ├── Automated launch (recovery instances)
   ├── Non-disruptive drill
   └── Pay-as-you-go pricing

5. Azure Site Recovery:
   ├── Replicate VMs → Azure
   ├── Recovery plans with scripts
   ├── Automated failover/failback
   ├── Compliance reporting
   └── Azure-native integration

ส่วนที่ 13: สรุปและ Checklist สำหรับเริ่มต้น

13.1 DR Implementation Checklist

DR Implementation Checklist:

Phase 1: Assessment (2-4 สัปดาห์)
├── [ ] ทำ Business Impact Analysis (BIA)
├── [ ] ระบุ critical systems + priority tier
├── [ ] กำหนด RTO/RPO สำหรับแต่ละระบบ
├── [ ] ทำ Risk Assessment
├── [ ] Map system dependencies
├── [ ] คำนวณ downtime cost (บาท/ชั่วโมง)
└── [ ] ได้รับ approval จาก management + budget

Phase 2: Strategy (2-4 สัปดาห์)
├── [ ] เลือก DR strategy สำหรับแต่ละ tier
├── [ ] เลือก DR site (co-location, cloud, hybrid)
├── [ ] เลือก replication technology
├── [ ] เลือก backup solution
├── [ ] Design DR network architecture
├── [ ] กำหนด communication plan
└── [ ] กำหนด roles & responsibilities

Phase 3: Implementation (4-12 สัปดาห์)
├── [ ] Setup DR site (hardware, network, storage)
├── [ ] Configure replication
├── [ ] Configure backup (3-2-1 rule)
├── [ ] Configure monitoring & alerting
├── [ ] เขียน DR plan document
├── [ ] เขียน runbooks สำหรับแต่ละระบบ
├── [ ] Configure DNS failover
├── [ ] Setup DR automation (IaC, scripts)
└── [ ] Train IT team

Phase 4: Testing (2-4 สัปดาห์)
├── [ ] Tabletop exercise
├── [ ] Walkthrough test
├── [ ] Simulation test (restore to test environment)
├── [ ] Parallel test (full DR site test)
├── [ ] Document test results
├── [ ] Fix issues found
└── [ ] Schedule regular test calendar

Phase 5: Maintenance (ongoing)
├── [ ] Review DR plan ทุก 3 เดือน
├── [ ] Update contact list ทุกเดือน
├── [ ] Test DR ตาม schedule
├── [ ] Update runbooks เมื่อมีการเปลี่ยนแปลง
├── [ ] Review backup success rate ทุกสัปดาห์
├── [ ] Monitor replication health ทุกวัน
├── [ ] Update DR plan เมื่อมี major change
└── [ ] Annual DR plan audit

13.2 The 3-2-1-1-0 Backup Rule

3-2-1-1-0 Backup Rule (Modern Version):

3 = จำนวน copies ของข้อมูล (1 production + 2 backup copies)
2 = เก็บบน media อย่างน้อย 2 ประเภท (disk + tape/cloud)
1 = อย่างน้อย 1 copy อยู่ offsite (different location)
1 = อย่างน้อย 1 copy เป็น immutable/air-gapped
    ├── Immutable: ไม่สามารถลบหรือแก้ไขได้
    ├── Air-gapped: ไม่เชื่อมต่อกับ network
    └── ป้องกัน ransomware ที่เข้ารหัส backup
0 = 0 errors — verify backup integrity ทุกครั้ง
    ├── Automated restore testing
    ├── Checksum verification
    └── Application-aware backup verification

IT Disaster Recovery Plan ไม่ใช่เอกสารที่ทำครั้งเดียวแล้ววางไว้บนหิ้ง — เป็น living document ที่ต้องอัปเดต ทดสอบ และปรับปรุงอย่างต่อเนื่อง สิ่งสำคัญที่สุดคือ เริ่มต้นทำวันนี้ ไม่ต้องรอให้สมบูรณ์แบบ แผน DR ที่ไม่สมบูรณ์แต่มีอยู่ ดีกว่าไม่มีแผนเลย ทุกองค์กรที่เคยประสบปัญหาร้ายแรงจะบอกเป็นเสียงเดียวกันว่า "ถ้ารู้อย่างนี้ ทำ DR plan ไว้ตั้งแต่แรก" — อย่าให้คุณเป็นคนถัดไปที่ต้องพูดประโยคนี้

.
.
.
.
.

SiamCafe.net — ชุมชน IT ที่ใหญ่ที่สุด · Siam2R.com — Portfolio งาน IT