Skip to content

SOC Disaster Recovery & Business Continuity Plan

Document ID: OPS-SOP-011 Version: 1.0 Classification: Confidential Last Updated: 2026-02-15

When the SOC itself goes down — SIEM failure, network outage, ransomware hitting SOC infrastructure — this plan tells you how to keep operating and how to recover.


Scope

This document covers: - 🛡️ SOC infrastructure failures (SIEM, EDR, ticketing, network) - 🔄 Failover and manual operating procedures - 📋 Recovery steps and priorities - 🎯 RPO/RTO targets for SOC services

This does NOT replace the organization-wide BCP/DR plan. It supplements it with SOC-specific procedures.


SOC Service Catalog & RTO/RPO

# SOC Service RTO RPO Priority Backup Method
1 Alert Monitoring (SIEM) 1 hour 0 min 🔴 P1 Secondary SIEM / manual log review
2 Incident Response (ticketing + comms) 30 min 0 min 🔴 P1 Email + phone + spreadsheet
3 EDR Console 2 hours 0 min 🔴 P1 Endpoint isolation via GPO/firewall
4 Threat Intelligence (TIP) 4 hours 24 hours 🟠 P2 Manual IOC lookup (VirusTotal, OTX)
5 Log Ingestion Pipeline 2 hours 1 hour 🟠 P2 Local log buffering / syslog failover
6 Detection Rules Engine 4 hours 0 min 🟠 P2 Rules stored in Git (restore from repo)
7 Dashboard / Reporting 8 hours 24 hours 🟡 P3 Manual reports via email
8 SOC Wiki / Knowledge Base 24 hours Weekly 🟡 P3 Offline copies / printed SOPs
9 Automation / SOAR 4 hours 0 min 🟠 P2 Manual playbook execution
10 Communication Channels (Slack/Teams) 30 min 0 min 🔴 P1 Phone tree + personal mobile

RTO = Recovery Time Objective (max downtime allowed) RPO = Recovery Point Objective (max data loss allowed)


Disaster Scenarios & Response Procedures

Scenario 1: SIEM Down / Unavailable

graph TD
    A[SIEM Down] --> B{Duration?}
    B -->|< 1 hour| C[Wait for auto-recovery]
    B -->|1-4 hours| D[Activate Secondary SIEM]
    B -->|> 4 hours| E[Manual Log Review Mode]
    C --> F[Verify backlog processing]
    D --> F
    E --> G[Deploy emergency queries]
    G --> F
    F --> H[Return to normal ops]
Step Action Owner Notes
1 Confirm SIEM is down (not a network issue) SOC Tier 1 Check SIEM health endpoint
2 Notify SOC Lead + SOC Engineering On-duty analyst Slack + phone
3 Activate secondary SIEM (if available) SOC Engineering Switch DNS/proxy for log forwarding
4 If no secondary: switch to Manual Log Review Mode SOC Tier 2 Use direct log access (see below)
5 Begin reviewing critical logs directly All analysts Focus: Firewall, AD, EDR console
6 Track alerts manually in emergency spreadsheet SOC Tier 1 Template: 11_Reporting_Templates/emergency_alert_tracker.xlsx
7 Inform stakeholders of reduced visibility SOC Lead Use Communication Templates
8 Once restored: verify no events were lost SOC Engineering Check log timestamps for gaps
9 Process any backlogged alerts All analysts Prioritize P1/P2 timeframes

Manual Log Review Mode — Where to look:

Log Source Direct Access Method Priority Checks
Firewall Firewall GUI / CLI Blocked outbound traffic, new allow rules
Active Directory Event Viewer on DC 4720 (new account), 4728 (group change), 4625 (failed logon)
EDR Console EDR web portal directly Active detections, quarantine events
Email Gateway Gateway admin portal Phishing detections, blocked attachments
Cloud (AWS/Azure) Cloud console directly IAM changes, security findings

Scenario 2: EDR Platform Down

Step Action Owner
1 Confirm EDR is down (check console + agent health) SOC Tier 1
2 Notify SOC Lead + vendor SOC Lead
3 Shift to network-based detection (IDS/IPS, firewall) SOC Tier 2
4 Deploy emergency GPO to block known-bad hashes SOC Engineering
5 Increase firewall monitoring (outbound anomalies) SOC Tier 1
6 If > 4 hours: consider temporary host-based logging (Sysmon) SOC Engineering
7 Once restored: run full environment scan EDR Admin

Scenario 3: Network Outage (SOC Can't Reach Monitored Environment)

Step Action Owner
1 Confirm scope of outage (partial vs full) SOC Tier 1
2 Contact NOC / network team for ETA SOC Lead
3 If partial: focus monitoring on reachable segments SOC Tier 2
4 If full: activate on-site SOC personnel (if available) SOC Manager
5 Monitor cloud environments independently SOC Tier 2
6 Once restored: scan for events during blackout period All analysts

Scenario 4: SOC Infrastructure Ransomware Attack

⚠️ This is the worst-case scenario. SOC tools themselves are compromised.

Step Action Owner Within
1 Isolate SOC network segment immediately SOC Engineering 5 min
2 Switch to out-of-band communication (personal mobile, pre-shared numbers) SOC Lead 10 min
3 Activate external IR vendor SOC Manager/CISO 30 min
4 Assess blast radius (which SOC tools are compromised?) IR Team 1 hour
5 Stand up emergency SOC on clean infrastructure SOC Engineering 2 hours
6 Restore from known-good backups (see backup schedule below) SOC Engineering 4 hours
7 Rebuild compromised systems from scratch (do NOT trust cleaned images) SOC Engineering 24–72 hours
8 Conduct full post-incident review IR Lead 1 week

Emergency SOC Kit (pre-staged): - Clean laptop with pre-configured VPN - Printed copy of SOPs and escalation matrix - Portable SIEM (e.g., Security Onion on USB) - Pre-shared phone list (not stored digitally only) - Backup cloud accounts (separate from production)


Scenario 5: Ticketing System / SOAR Down

Step Action Owner
1 Confirm ticketing/SOAR is unavailable SOC Tier 1
2 Switch to emergency alert tracker (Google Sheets / Excel) SOC All
3 Execute playbooks manually using this SOP repo SOC Tier 2
4 Track all manual actions for later entry SOC Tier 1
5 Once restored: backfill tickets from emergency tracker SOC Tier 1

Backup Schedule

System Backup Type Frequency Retention Storage Location
SIEM Configuration Full export Daily 30 days S3 / off-site NAS
SIEM Data Incremental Hourly 90 days Separate storage cluster
Detection Rules (Sigma/YARA) Git repo On every change Unlimited GitHub + local mirror
SOAR Playbooks Export Daily 30 days S3 / off-site NAS
Ticketing System DB Full + WAL Hourly 30 days Cross-region DB replica
EDR Configuration Vendor snapshot Weekly 12 weeks Cloud backup
SOC Wiki / Docs Git repo On every change Unlimited GitHub + local mirror
Dashboards / Reports JSON export Weekly 90 days S3 / off-site NAS
TI Feeds / IOC DB Full dump Daily 30 days S3 / off-site NAS

Backup Verification

  • Monthly: Restore test of SIEM config backup
  • Quarterly: Full DR drill (simulate SIEM failure + recovery)
  • Annually: Full SOC infrastructure recovery test

Communication During DR

Priority Channel Use
🔴 Primary Phone tree (see Escalation Matrix) When all digital channels may be compromised
🟠 Secondary Personal mobile + WhatsApp group If Slack/Teams is down but phones work
🟡 Tertiary Pre-shared email on alternate provider (e.g., Gmail) If corporate email is down
🔵 Last resort Physical meetup at pre-designated location Total infrastructure failure

⚠️ Pre-stage these channels NOW. Don't set them up during an incident.


DR Testing Schedule

Test Type Frequency Duration Participants Pass Criteria
Tabletop Exercise (DR scenario discussion) Quarterly 2 hours SOC team + Management Roles clear, procedures understood
SIEM Failover Test Semi-annually 4 hours SOC Engineering RTO met within target, no data loss
Full DR Drill (complete SOC infrastructure) Annually 1 day SOC team + IT + Management All P1 services restored within RTO
Communication Test (phone tree + backup comms) Quarterly 30 min All SOC personnel 100% reachable via backup channel
Backup Restore Test Monthly 2 hours SOC Engineering Config restored, rules validated

SOC Degradation Levels

Level Condition Capability Actions
🟢 Normal All systems operational Full detection + response Normal operations
🟡 Degraded 1–2 secondary tools down Partial detection, full response Notify stakeholders, increase manual checks
🟠 Limited SIEM or EDR down Significantly reduced detection Activate manual procedures, notify management
🔴 Emergency Multiple critical tools down Manual-only operations Full DR activation, external IR support
Black SOC infrastructure compromised Zero internal capability External IR vendor takes lead

Annual Review Checklist

  • Review and update all RTO/RPO targets
  • Verify backup procedures are working
  • Update contact lists and phone trees
  • Test DR communication channels
  • Conduct at least one full DR drill
  • Update emergency SOC kit contents
  • Review vendor contracts for IR support
  • Brief all SOC personnel on DR procedures
  • Update this document with lessons learned