---
name: runbook-generator
description: Turns a plain-English description of "how we do X" into a formatted, production-ready runbook with roles, steps, rollback, verification, and escalation.
version: 1.0.0
author: VantagePoint Networks
audience: IT Managers, Operations Leads, SREs, Service Desk Managers
output_format: Formatted Markdown runbook, ready to paste into Confluence, SharePoint, Notion, or a Git-tracked `/runbooks` folder.
license: MIT
---

# Runbook Generator

Describe a procedure the way you'd explain it to a new hire at lunch. Get back a runbook the new hire can execute at 3am, alone, under pressure, with nobody to ask.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. Run `/runbook-generator` in Claude Code. Describe the procedure in whatever order it comes out - the skill will structure it.

## When to use this

- You have "tribal knowledge" living only in one engineer's head and you want to de-risk that.
- You're writing a bid or compliance response and need documented operational procedures.
- You're standing up a new service desk and need a library of playbooks fast.
- You already have a runbook but it's grown messy over 3 years of ad-hoc edits and you want a clean rewrite.
- You want every runbook in your organisation to follow the same shape.

## What you'll get

- A fully structured runbook with the standard sections (prerequisites, triggers, roles, steps, verification, rollback, escalation).
- **Role-aware step numbering** - each step labelled with who executes it (L1 agent, on-call engineer, vendor, requester).
- **Expected duration per step** so the operator knows when to escalate.
- **Decision points** called out explicitly - "if X, continue; if Y, jump to step N or escalate".
- **Copy-paste commands / snippets** where relevant, quoted in monospace blocks.
- **Rollback procedure** as a first-class citizen, not an afterthought.
- **Verification / success criteria** stated in terms the business would recognise, not just "command returns 0".
- **Review metadata** - author, date, next review due, linked changes/incidents.

## Clarifying questions I will ask you

1. **What's the procedure called?** (Name it as you'd search for it at 3am.)
2. **What triggers this runbook?** (Alert, ticket type, calendar event, user request, manual decision)
3. **Who typically runs it?** (L1, L2, on-call, specific team, external vendor)
4. **How often?** (Rare one-off, weekly, every alert, during incidents only)
5. **What's the expected outcome when it goes right?**
6. **Walk me through the steps, in any order.** (I'll sequence them.)
7. **What could go wrong at each step, and what's the recovery?**
8. **What must be true before you can start?** (Access, licences, approvals, maintenance window)
9. **When do you stop and escalate?** (Specific conditions, not just "if it feels wrong")
10. **How do you know it worked?** (The verification test - business outcome, not just "no errors")

## Output template

```markdown
# Runbook: <Procedure Name>

**Runbook ID:** RB-<short-slug>
**Version:** 1.0
**Author:** <name>
**Last reviewed:** YYYY-MM-DD
**Next review due:** YYYY-MM-DD (max 12 months)
**Classification:** Internal / Confidential
**Linked procedures:** <related runbook IDs>

## 1. Purpose
<One paragraph: what this runbook accomplishes and why it exists.>

## 2. When to use this runbook
**Triggers:**
- <Specific alert name / ticket type / condition>
- <Specific alert name / ticket type / condition>

**Do NOT use this runbook if:**
- <Out-of-scope condition - redirects reader to the right runbook>

## 3. Prerequisites
Before starting, confirm ALL of the following:
- [ ] You have <access / role> - check with `<command or URL>`
- [ ] <Required tool / VPN / credential> is available
- [ ] <Approval / change ticket> is in place (if applicable)
- [ ] You have <physical access / out-of-band console / bridge call>
- [ ] <Backup / snapshot / failover path> is current

If any of the above is missing: stop, do not proceed, contact <escalation>.

## 4. Roles
| Role | Responsibility |
|---|---|
| Primary operator | Executes the procedure |
| Scribe / witness | Observes and records (if change-managed) |
| Approver | Signs off at the decision point in step N |
| Escalation point | Contactable during the procedure |

## 5. Estimated total duration
<X minutes> under normal conditions, up to <Y minutes> if step N requires the slow path.

## 6. Procedure

### Step 1 - <What> (Role: <operator> | Est: <minutes>)
<What to do, in plain imperative language.>

**Command / action:**
```<language>
<exact command or action>
```

**Expected result:**
<What you should see.>

**If you don't see that:**
<Troubleshooting hint or escalation.>

---

### Step 2 - <What> (Role: <operator> | Est: <minutes>)
<...>

**Decision point:** <yes/no question>
- If **yes**: continue to step 3.
- If **no**: jump to step 7 (rollback).

---

### Step 3 - <What> (Role: <approver> | Est: <minutes>)
<Approval step - explicit because it changes who is acting.>

---

<repeat for each step>

## 7. Rollback / Backout Procedure
Triggered if: <specific condition>

### Rollback Step 1 - <What>
```<language>
<command>
```

### Rollback Step 2 - <What>
<...>

**Rollback verification:**
- [ ] <Check 1>
- [ ] <Check 2>

## 8. Verification / Success Criteria
The procedure is considered successful when ALL of the following are true:
- [ ] <Business-facing check - e.g. "Users can log in from the staff laptop">
- [ ] <Technical check - e.g. "Output of `show ip bgp summary` matches baseline below">
- [ ] <Monitoring check - e.g. "No P1/P2 alerts firing for <service>">

If any check fails: do not close the ticket, escalate per section 9.

## 9. Escalation
| Condition | Escalate to | Channel | Response time |
|---|---|---|---|
| <Failure pattern 1> | <team / person> | <phone / pager / Slack> | <SLA> |
| <Failure pattern 2> | <team / person> | <phone / pager / Slack> | <SLA> |
| <Unknown / unhandled> | <incident commander> | <bridge number> | immediate |

## 10. Known Issues & Gotchas
- <A thing that has bitten people before and how to spot it early>
- <A bug in the tooling that has a known workaround>
- <A timing issue, e.g. "do NOT run during the 02:00 UTC backup window">

## 11. References
- <Linked runbooks>
- <Vendor documentation URL>
- <Previous incidents that produced this runbook>
- <Change ticket that introduced it>

## 12. Change History
| Version | Date | Author | Change |
|---|---|---|---|
| 1.0 | YYYY-MM-DD | <name> | Initial |
```

## Example invocation

**User:** "/runbook-generator - I want a runbook for when our VPN concentrator fails over to the secondary appliance. Sometimes we have to manually force sessions back after the primary recovers, which takes about 10 minutes. Usually triggered by a monitoring alert called 'VPN_HA_FAILOVER'."

**What the skill will do:**
1. Ask: does L1 handle this or does it wake on-call? Are there any user comms needed mid-failback? What breaks if you forget to do it?
2. Produce a runbook with prerequisites (admin access to both appliances, maintenance window confirmation if during business hours), clearly separated failback steps, a decision point ("is session count on secondary < 50? yes -> continue, no -> defer to next maintenance window"), rollback (force failover back to secondary if primary misbehaves), and a verification pass that includes user-facing check ("remote users can still connect and route traffic").

## Notes for the requester

- **Don't edit while you're describing.** Dump everything - the skill will sequence. Trying to dictate the "right" order often loses the edge cases.
- **Name the steps after what they accomplish**, not the command they run. "Confirm primary is healthy" is better than "run show ha-status".
- **Include the failure stories.** "Last time we did this, the cert expired mid-way and we had to re-issue" - that becomes a Known Issue in section 10.
- **Reference the real monitoring alert by name.** If the alert is `VPN_HA_FAILOVER`, write exactly that - L1 agents copy-paste these into search bars.
- **"Good" looks like:** a new L2 can execute this runbook alone, at 3am, without needing to ping the author. If a step makes the reader ask "but what about X?", step needs to be split or a decision point added.
