--- name: post-mortem-facilitator description: Turns a messy Slack thread and partial timeline into a blameless post-mortem with 5-whys, action items, and an honest "what we got lucky on" section. version: 1.0.0 author: VantagePoint Networks audience: IT Managers, SREs, Incident Commanders, Engineering Leads output_format: Formatted Markdown blameless post-mortem, ready to share with the team, leadership, or publish externally (after redaction). license: MIT --- # Post-Mortem Facilitator A conversational walkthrough that converts the scattered evidence of a handled incident into a blameless post-mortem the team will actually read and learn from. ## How to use this skill 1. Download this `SKILL.md` file. 2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows). 3. Run `/post-mortem-facilitator` in Claude Code. Paste what you've got - Slack / Teams thread, timeline, ticket comments, graph screenshots described in words. The skill will interview you and produce the write-up. ## When to use this - An incident has been resolved but the write-up has been sitting in "someone's to-do" for two weeks. - You're new to running post-mortems and want a structure that won't turn into a blame session. - You want to publish an external version for customers and need a redact-safe draft alongside the internal one. - You're building a library of post-mortems and want every one to follow the same shape. - A senior leader has asked "what did we learn from X?" and you need a defensible answer in an hour. ## What you'll get - **Summary** (3-4 sentences) for anyone who will only read the first page. - **Timeline** in a structured table: time, event, who was involved, what was known at the time. - **Root cause + contributing factors** - split, because they're not the same thing. - **Five whys** that stop only when "further whys" become unactionable. - **What went well** - not as a softener, but because reinforcing good behaviour matters. - **What didn't go well** - specific and behaviour-based, never person-blaming. - **Where we got lucky** - the near-miss list. Most valuable section most teams skip. - **Action items** with owners, due dates, and a "will this actually prevent recurrence?" check. - **External-facing version** - same incident, different framing, redact-safe. ## Clarifying questions I will ask you 1. **What was the customer-visible effect, in one sentence?** 2. **When did it start, when was it detected, when was it resolved?** (Three separate times.) 3. **How did we detect it?** (Monitoring / user report / chance / external party) 4. **Who was involved in responding, and what role did each play?** (Commander, comms, scribe, SMEs, exec sponsor) 5. **What was your initial hypothesis vs what turned out to be the actual cause?** 6. **What specifically fixed it?** (The action that moved state from broken to working, not "a restart" without detail.) 7. **What's one thing that, if it had been even slightly different, would have made this much worse?** (Flushes out the luck factor.) 8. **What are you most worried will happen if you don't address this?** 9. **Has this type of incident happened before?** (Pattern detection.) 10. **Who will actually own each action item?** (Named individuals, not teams.) ## Output template ```markdown # Post-Mortem: - YYYY-MM-DD **Incident ID:** INC-YYYY-NNNN **Severity:** P1 / P2 / P3 / P4 **Customer impact:** unavailable for / degraded for **Status:** Resolved **Next review at:** **Classification:** Internal / External-shareable / Redacted external ## 1. Summary > <3-4 sentences. Readable by a non-engineer. Include what broke, who was affected, for how long, what fixed it, and whether recurrence risk is high/medium/low.> ## 2. Impact - **Services affected:** - **Customer segments:** - **Geographic scope:** - **Duration:** Start - Detected - Resolved - **Revenue / SLA impact:** - **Data integrity / confidentiality:** ## 3. Timeline All times in UTC. Italic rows = events we learned about during the investigation. | Time | Who/what | Event | |---|---|---| | HH:MM | | | | HH:MM | | | | HH:MM | | | | HH:MM | | | | HH:MM | | | | HH:MM | | | **What we didn't know at the time (and wish we had):** - <...> - <...> ## 4. Root Cause Analysis ### The one-sentence cause ### Five whys 1. Why did the service fail? 2. Why? 3. Why? 4. Why? 5. Why? ### Contributing factors (not the root cause, but made it worse or slower to detect) - **** - - **** - - **** - ## 5. Data impact - **Data lost:** None / - **Data exposed:** None / - **Personal data involved:** Yes / No - if yes, - **Regulatory notification:** Required / Not required - ## 6. What went well Concrete, behaviour-based. Not "the team worked hard." - - - ## 7. What didn't go well Behaviour-based, not person-based. "Monitoring gap" not "Alex should have caught it." - - - ## 8. Where we got lucky The near-miss list. Read carefully - these are free warnings. - - - ## 9. Action items Each action must answer: what, who, by when, and "would this prevent recurrence?" | # | Action | Owner | Due | Prevents recurrence? | Status | |---|---|---|---|---|---| | A1 | | | YYYY-MM-DD | Yes / Partially / Detects-only | Open | | A2 | | | YYYY-MM-DD | Yes / Partially / Detects-only | Open | | A3 | | | YYYY-MM-DD | Yes / Partially / Detects-only | Open | ### Actions explicitly rejected (and why) - "" - rejected because . Revisit if . ## 10. Supporting evidence - Graphs / dashboards: - Related tickets: - Vendor case numbers: - Related previous incidents: ## 11. External-facing version Draft for status page / customer email / blog, with internal details redacted. > Between and , some customers experienced . We identified the cause as . Service was restored at . We apologise for the disruption. The steps we are taking to prevent recurrence are: <2-3 bullets>. If you were affected and have questions, contact . ### Redaction checklist (before publishing externally) - [ ] No internal system / service names - [ ] No personal names - [ ] No security vulnerability detail that could be weaponised - [ ] No vendor blame that is not publicly verifiable - [ ] Legal / PR review if impact was significant ``` ## Example invocation **User:** "/post-mortem-facilitator - we had a 90-minute outage last Thursday where logins broke for staff. Turned out the SAML cert had expired. Nobody noticed because the expiry reminder went to a shared mailbox nobody reads." **What the skill will do:** 1. Ask detection/start/resolve times, who was on bridge, what actually fixed it (cert reissue + metadata re-upload?), blast radius (staff-only or customers?), and crucially the "what went lucky" angle (what if it had been the customer-facing IdP instead?). 2. Keep drilling on "why" - cert expired (symptom) -> reminder email ignored (miss) -> reminder went to shared inbox (miss) -> no process owner for that inbox (root) -> no automated monitoring of cert expiry (control gap). 3. Produce a post-mortem that gets action items like "add cert expiry monitoring to SAML, NTP, internal CA" with named owners, and flags the near-miss (customer IdP uses the same broken process - high-priority follow-up). ## Notes for the requester - **"Blameless" doesn't mean "no accountability."** Actions still have owners. It means the write-up describes systems and decisions, not individuals' failings. People make mistakes; the systems around them should catch those mistakes. - **If you don't know something, say "unknown."** "Unknown" is a valid timeline entry. Speculation ("the engineer probably did X") corrodes the value of the document. - **Timelines from Slack are usually wrong by 2-3 minutes.** Cross-reference with the monitoring tool's timestamps if accuracy matters. - **Action items must be testable.** "Improve monitoring" is not an action. "Add Prometheus alert for certificate expiry <14 days, page on-call" is an action. - **"Where we got lucky" is the most valuable section.** Spend as much time here as you do on root cause. Near-misses are where the next P1 lives. - **"Good" looks like:** someone who wasn't on the incident reads it, understands exactly what happened and why, and can identify whether their area has the same latent risk.