--- name: runbook-generator description: Turns a plain-English description of "how we do X" into a formatted, production-ready runbook with roles, steps, rollback, verification, and escalation. version: 1.0.0 author: VantagePoint Networks audience: IT Managers, Operations Leads, SREs, Service Desk Managers output_format: Formatted Markdown runbook, ready to paste into Confluence, SharePoint, Notion, or a Git-tracked `/runbooks` folder. license: MIT --- # Runbook Generator Describe a procedure the way you'd explain it to a new hire at lunch. Get back a runbook the new hire can execute at 3am, alone, under pressure, with nobody to ask. ## How to use this skill 1. Download this `SKILL.md` file. 2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows). 3. Run `/runbook-generator` in Claude Code. Describe the procedure in whatever order it comes out - the skill will structure it. ## When to use this - You have "tribal knowledge" living only in one engineer's head and you want to de-risk that. - You're writing a bid or compliance response and need documented operational procedures. - You're standing up a new service desk and need a library of playbooks fast. - You already have a runbook but it's grown messy over 3 years of ad-hoc edits and you want a clean rewrite. - You want every runbook in your organisation to follow the same shape. ## What you'll get - A fully structured runbook with the standard sections (prerequisites, triggers, roles, steps, verification, rollback, escalation). - **Role-aware step numbering** - each step labelled with who executes it (L1 agent, on-call engineer, vendor, requester). - **Expected duration per step** so the operator knows when to escalate. - **Decision points** called out explicitly - "if X, continue; if Y, jump to step N or escalate". - **Copy-paste commands / snippets** where relevant, quoted in monospace blocks. - **Rollback procedure** as a first-class citizen, not an afterthought. - **Verification / success criteria** stated in terms the business would recognise, not just "command returns 0". - **Review metadata** - author, date, next review due, linked changes/incidents. ## Clarifying questions I will ask you 1. **What's the procedure called?** (Name it as you'd search for it at 3am.) 2. **What triggers this runbook?** (Alert, ticket type, calendar event, user request, manual decision) 3. **Who typically runs it?** (L1, L2, on-call, specific team, external vendor) 4. **How often?** (Rare one-off, weekly, every alert, during incidents only) 5. **What's the expected outcome when it goes right?** 6. **Walk me through the steps, in any order.** (I'll sequence them.) 7. **What could go wrong at each step, and what's the recovery?** 8. **What must be true before you can start?** (Access, licences, approvals, maintenance window) 9. **When do you stop and escalate?** (Specific conditions, not just "if it feels wrong") 10. **How do you know it worked?** (The verification test - business outcome, not just "no errors") ## Output template ```markdown # Runbook: **Runbook ID:** RB- **Version:** 1.0 **Author:** **Last reviewed:** YYYY-MM-DD **Next review due:** YYYY-MM-DD (max 12 months) **Classification:** Internal / Confidential **Linked procedures:** ## 1. Purpose ## 2. When to use this runbook **Triggers:** - - **Do NOT use this runbook if:** - ## 3. Prerequisites Before starting, confirm ALL of the following: - [ ] You have - check with `` - [ ] is available - [ ] is in place (if applicable) - [ ] You have - [ ] is current If any of the above is missing: stop, do not proceed, contact . ## 4. Roles | Role | Responsibility | |---|---| | Primary operator | Executes the procedure | | Scribe / witness | Observes and records (if change-managed) | | Approver | Signs off at the decision point in step N | | Escalation point | Contactable during the procedure | ## 5. Estimated total duration under normal conditions, up to if step N requires the slow path. ## 6. Procedure ### Step 1 - (Role: | Est: ) **Command / action:** ``` ``` **Expected result:** **If you don't see that:** --- ### Step 2 - (Role: | Est: ) <...> **Decision point:** - If **yes**: continue to step 3. - If **no**: jump to step 7 (rollback). --- ### Step 3 - (Role: | Est: ) --- ## 7. Rollback / Backout Procedure Triggered if: ### Rollback Step 1 - ``` ``` ### Rollback Step 2 - <...> **Rollback verification:** - [ ] - [ ] ## 8. Verification / Success Criteria The procedure is considered successful when ALL of the following are true: - [ ] - [ ] - [ ] "> If any check fails: do not close the ticket, escalate per section 9. ## 9. Escalation | Condition | Escalate to | Channel | Response time | |---|---|---|---| | | | | | | | | | | | | | | immediate | ## 10. Known Issues & Gotchas - - - ## 11. References - - - - ## 12. Change History | Version | Date | Author | Change | |---|---|---|---| | 1.0 | YYYY-MM-DD | | Initial | ``` ## Example invocation **User:** "/runbook-generator - I want a runbook for when our VPN concentrator fails over to the secondary appliance. Sometimes we have to manually force sessions back after the primary recovers, which takes about 10 minutes. Usually triggered by a monitoring alert called 'VPN_HA_FAILOVER'." **What the skill will do:** 1. Ask: does L1 handle this or does it wake on-call? Are there any user comms needed mid-failback? What breaks if you forget to do it? 2. Produce a runbook with prerequisites (admin access to both appliances, maintenance window confirmation if during business hours), clearly separated failback steps, a decision point ("is session count on secondary < 50? yes -> continue, no -> defer to next maintenance window"), rollback (force failover back to secondary if primary misbehaves), and a verification pass that includes user-facing check ("remote users can still connect and route traffic"). ## Notes for the requester - **Don't edit while you're describing.** Dump everything - the skill will sequence. Trying to dictate the "right" order often loses the edge cases. - **Name the steps after what they accomplish**, not the command they run. "Confirm primary is healthy" is better than "run show ha-status". - **Include the failure stories.** "Last time we did this, the cert expired mid-way and we had to re-issue" - that becomes a Known Issue in section 10. - **Reference the real monitoring alert by name.** If the alert is `VPN_HA_FAILOVER`, write exactly that - L1 agents copy-paste these into search bars. - **"Good" looks like:** a new L2 can execute this runbook alone, at 3am, without needing to ping the author. If a step makes the reader ask "but what about X?", step needs to be split or a decision point added.