---
name: dr-test-planner
description: Designs a disaster recovery test with scope, scenarios, success criteria, observer brief, and a post-test review framework.
version: 1.0.0
author: VantagePoint Networks
audience: IT Managers, BCP/DR Coordinators, Infrastructure Leads, Risk & Compliance Managers
output_format: Formatted Markdown DR test plan, ready for BCP committee approval and for handing to the test coordinator on the day.
license: MIT
---

# DR Test Planner

A structured plan for a DR test that is safe to run, measurable against RTO/RPO, and produces an improvement backlog rather than just a tick-box certificate.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. Run `/dr-test-planner` in Claude Code. Describe the scope and target scenario. Answer the safety and success-criteria questions. Receive the test plan.

## When to use this

- Annual DR test is due and the business expects a real exercise, not a desk review.
- You've just invested in a DR solution and need to validate it works before trusting it.
- A regulator or customer contract requires documented evidence of DR testing.
- You've changed the infrastructure (cloud migration, new data centre, new backup tooling) and need a fresh baseline.
- You're rebuilding the BCP programme and want every test to follow the same shape.

## What you'll get

- **Test type and scope** - tabletop / walkthrough / simulation / parallel / full failover - with explicit rationale.
- **Scenario brief** - single realistic scenario, injected events, what participants do and don't know in advance.
- **Pre-test checklist** - prerequisites, approvals, safety nets, rollback plan.
- **Success criteria** - RTO / RPO targets, business-facing tests, technical checks.
- **Observer brief** - one per major function, with what to watch and record.
- **Roles and RACI** - test director, observers, participants, exec sponsor, customer liaison.
- **Timeline for the day** - minute-by-minute for the first hour, hour-by-hour after.
- **Safety rules** - stop conditions, escalation paths, production-protection measures.
- **Post-test review framework** - scoring against success criteria, findings categorised, improvement backlog.
- **Executive summary template** - pre-written skeleton for the board / regulator report post-test.

## Clarifying questions I will ask you

1. **What DR scenario are you testing?** (DC loss, cyber/ransomware, WAN loss, cloud-provider outage, building unavailable, key-person loss, regional event)
2. **What's the scope?** (One service, a system group, a site, the whole business)
3. **What RTO and RPO are you testing against?** (From your BIA - hours or days)
4. **What test type is appropriate?** (Tabletop = discussion, Walkthrough = step-through with no system action, Simulation = controlled execution, Parallel = DR system runs alongside prod, Full failover = prod goes down, DR goes live)
5. **How much can safely happen in production?** (Full failover tests are rare and risky.)
6. **When is the test window?** (Business-hours = louder signal; out-of-hours = safer but tests a different scenario)
7. **Who's running it?** (Internal, or with vendor/partner involvement)
8. **What are the stop conditions?** (At what point do we abort and rollback?)
9. **Who is the exec sponsor, and who are the observers?**
10. **Are you notifying customers?** (Full test may need customer comms even if successful)
11. **Regulatory / contractual obligations?** (Some require specific evidence formats)
12. **When was the last DR test, and what was its outcome?** (Builds on prior findings)

## Output template

```markdown
# DR Test Plan: <scenario name> - YYYY-MM-DD

**Plan ID:** DRT-YYYY-NNN
**Test Director:** <name>
**Exec Sponsor:** <name>
**BCP Committee approval:** Pending / Approved on YYYY-MM-DD
**Scheduled:** YYYY-MM-DD HH:MM-HH:MM <TZ>
**Test type:** Tabletop / Walkthrough / Simulation / Parallel / Full failover

## 1. Executive Summary
> <One paragraph: what we're testing, why this scenario, what success looks like, what the business should notice (or not), how long the test lasts, and what happens if it fails.>

## 2. Scope

### In scope
- <Service / system / site>
- <Service / system / site>

### Explicitly out of scope
- <Service that will NOT be touched>
- <System that remains on production path>

### Business units affected
- <BU 1> - <how they participate>
- <BU 2> - <how they participate>

## 3. Objectives
1. <Objective - measurable>
2. <Objective - measurable>
3. <Objective - measurable>

**Primary question this test answers:** <one sentence>

## 4. Scenario
<Narrative, 2-3 paragraphs. Realistic. Names a plausible trigger. Describes the state participants wake up to. What do they know? What don't they know?>

### Inject timeline
| Time offset | Inject | Delivered by | To whom |
|---|---|---|---|
| T+0 | "Primary DC has lost power" | Test Director | All participants |
| T+10 | "Status page provider also down" | Test Director | Comms Lead |
| T+25 | "CEO has asked for an update" | Exec Sponsor | IC |

## 5. Success Criteria

### Business-level (pass/fail)
- [ ] Critical service <name> is available from DR site within <RTO> minutes
- [ ] Data loss at failover <= <RPO> minutes
- [ ] Staff can log in and perform core function <test> within <time>
- [ ] Customer-facing <test> succeeds from external network

### Technical checks
- [ ] Failover automation triggers correctly or manual steps complete within runbook time
- [ ] DNS propagates within <time>
- [ ] Authentication works at DR site
- [ ] Monitoring at DR site is visible
- [ ] Backups at DR site are current per RPO

### Process / people checks
- [ ] Test Director can reach all required participants within 10 minutes
- [ ] Runbooks used are accurate - deviations are recorded
- [ ] Decisions made have clear owners and are documented
- [ ] Comms cadence is maintained per plan

### Regulatory / contractual evidence
- [ ] <evidence item 1>
- [ ] <evidence item 2>

## 6. Pre-Test Checklist (T-5 business days)
- [ ] BCP committee approval documented
- [ ] Exec sponsor briefed and available on the day
- [ ] All participants briefed on their role (not the scenario detail)
- [ ] Customer comms drafted and reviewed (if applicable)
- [ ] Backup of any system that will be changed
- [ ] Rollback path tested and confirmed
- [ ] Safety contacts staffed (production on-call NOT participating in test)
- [ ] Observer briefings distributed (see section 8)
- [ ] Recording / transcription method agreed

## 7. Roles
| Role | Name | Responsibility |
|---|---|---|
| Test Director | <name> | Runs the test, delivers injects, calls stop |
| Exec Sponsor | <name> | Approves go, can call abort |
| Incident Commander (participant) | <name> | Leads response - treats test as real |
| Comms Lead (participant) | <name> | Runs comms per plan |
| Scribe / Recorder | <name> | Captures timeline and decisions |
| Observers | <names> | One per function - see section 8 |
| Safety Controller | <name> | Monitors production, has abort authority |
| Vendor Liaison (if applicable) | <name> | On call if third-party involved |

## 8. Observer Brief
Each observer watches one function. They do not participate. They record behaviour.

### Observer - Comms function
**What to record:**
- Time of first customer-facing message after incident declared
- Accuracy and consistency of messages across channels
- Adherence to cadence
- Any external customer complaints surfacing during test

### Observer - Technical response
**What to record:**
- Time from declaration to first action
- Whether runbooks were used as written or improvised
- Decision points and evidence used
- Handovers between shifts

### Observer - Leadership / governance
**What to record:**
- Time from declaration to exec notification
- Quality and cadence of exec briefings
- Decision authority clarity

## 9. Timeline

### T-24h
- Confirm all participants available
- Final safety check on production
- Exec sponsor go/no-go

### T-1h
- Pre-brief observers
- Confirm Safety Controller in position
- Staff production on-call (not participating)

### T+0 (test start)
- Test Director delivers first inject
- Participants respond

### T+0 to T+RTO
- Participants run their real runbooks
- Observers record
- Injects delivered per section 4

### T+RTO
- Primary success check
- If fail: decide continue vs abort

### T+RTO*2 (or sooner if success achieved)
- Test stop
- Hot wash (30-min immediate review)
- Rollback begins if applicable

### T+4 hours max
- Systems restored to pre-test state
- Participants released
- Evidence secured

## 10. Safety Rules (must read aloud at test start)
- **Stop conditions** (any observer or participant can call stop):
  - Production incident detected during test
  - Customer impact exceeds pre-agreed tolerance
  - Any participant reports harm or unmanageable stress
  - Scenario becomes unsafe or unrealistic
- **Stop phrase:** "STOP TEST" - three times
- **After stop:** Test Director confirms cessation; all participants pause; rollback begins
- **Safety Controller** has abort authority regardless of Test Director
- **No real customer comms** are sent unless explicitly part of scope

## 11. Rollback / Restoration Plan
<How the environment returns to pre-test state, step by step, with owner and time estimate per step>

## 12. Post-Test Review Framework

### Immediate (hot wash, 30 min after test)
- What did you observe?
- What felt wrong?
- What did you learn you didn't know before?

### Structured review (within 5 business days)
For each success criterion in section 5:
- Pass / Fail / Partial
- Evidence
- Root cause of any fail
- Recommendation

Findings categorised:
- **Critical:** DR capability compromised; immediate remediation
- **High:** Material gap; remediate within quarter
- **Medium:** Improvement opportunity; track in backlog
- **Low:** Note for reference

### Improvement backlog
| # | Finding | Severity | Owner | Due | Linked risk register entry |
|---|---|---|---|---|---|

## 13. Executive Summary (post-test)
> <Skeleton - fill in after the test>
>
> On <date>, <org> ran a <type> DR test simulating <scenario>. The test lasted <duration> with <N> participants and <N> observers. The business-level success criteria were <N met / N not met>. Primary findings: <3 bullets>. Remediation backlog: <N> items tracked, with the highest priority being <top item>. The next scheduled DR test is <date>, covering <scope>.
```

## Example invocation

**User:** "/dr-test-planner - we want to test failover of our London-to-Dublin DR arrangement for the customer portal. RTO 2 hours, RPO 15 minutes. Haven't tested it since install 18 months ago. I want a parallel test, not a full cutover."

**What the skill will do:**
1. Ask scope (portal + its dependencies, or portal only?), business-hours vs weekend, whether customers will be notified, exec sponsor, and what previous tests (or their absence) tell us about baseline confidence.
2. Propose a parallel test scenario where DR site is brought up alongside prod, transactional replay is verified, DNS is NOT cut over but is confirmed ready-to-cut.
3. Produce the plan with injects that probe weak spots (CEO demanding an update at T+25, status page provider "also down" to force manual comms), and success criteria tied to the 2h/15m targets.
4. Include the Safety Controller role and explicit stop conditions to protect production.

## Notes for the requester

- **Choose the smallest test type that answers your question.** A tabletop costs an hour; a full failover can cost a weekend plus risk. Pick the type that makes the business smarter for the least cost.
- **Observers are the product, not the participants.** The test generates findings because observers *record behaviour*. No observers = no learning, just a tick-box.
- **Stop conditions are mandatory.** A test that cannot be stopped becomes an incident. Read them aloud at start; everyone agrees.
- **Rollback before declaring success.** Confirm the environment is fully back to pre-test state before releasing participants. Half-failed-over DR is worse than no DR.
- **Last-year's findings are this-year's gates.** If last year found that runbook X was out of date, this year's pre-check confirms it's updated. Tests compound.
- **"Good" looks like:** you produce a regulator-ready report, a named improvement backlog, and at least two findings you didn't already know. A test that produces zero findings is suspicious - either it wasn't realistic or observers weren't looking hard enough.
