---
name: post-mortem-facilitator
description: Turns a messy Slack thread and partial timeline into a blameless post-mortem with 5-whys, action items, and an honest "what we got lucky on" section.
version: 1.0.0
author: VantagePoint Networks
audience: IT Managers, SREs, Incident Commanders, Engineering Leads
output_format: Formatted Markdown blameless post-mortem, ready to share with the team, leadership, or publish externally (after redaction).
license: MIT
---

# Post-Mortem Facilitator

A conversational walkthrough that converts the scattered evidence of a handled incident into a blameless post-mortem the team will actually read and learn from.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. Run `/post-mortem-facilitator` in Claude Code. Paste what you've got - Slack / Teams thread, timeline, ticket comments, graph screenshots described in words. The skill will interview you and produce the write-up.

## When to use this

- An incident has been resolved but the write-up has been sitting in "someone's to-do" for two weeks.
- You're new to running post-mortems and want a structure that won't turn into a blame session.
- You want to publish an external version for customers and need a redact-safe draft alongside the internal one.
- You're building a library of post-mortems and want every one to follow the same shape.
- A senior leader has asked "what did we learn from X?" and you need a defensible answer in an hour.

## What you'll get

- **Summary** (3-4 sentences) for anyone who will only read the first page.
- **Timeline** in a structured table: time, event, who was involved, what was known at the time.
- **Root cause + contributing factors** - split, because they're not the same thing.
- **Five whys** that stop only when "further whys" become unactionable.
- **What went well** - not as a softener, but because reinforcing good behaviour matters.
- **What didn't go well** - specific and behaviour-based, never person-blaming.
- **Where we got lucky** - the near-miss list. Most valuable section most teams skip.
- **Action items** with owners, due dates, and a "will this actually prevent recurrence?" check.
- **External-facing version** - same incident, different framing, redact-safe.

## Clarifying questions I will ask you

1. **What was the customer-visible effect, in one sentence?**
2. **When did it start, when was it detected, when was it resolved?** (Three separate times.)
3. **How did we detect it?** (Monitoring / user report / chance / external party)
4. **Who was involved in responding, and what role did each play?** (Commander, comms, scribe, SMEs, exec sponsor)
5. **What was your initial hypothesis vs what turned out to be the actual cause?**
6. **What specifically fixed it?** (The action that moved state from broken to working, not "a restart" without detail.)
7. **What's one thing that, if it had been even slightly different, would have made this much worse?** (Flushes out the luck factor.)
8. **What are you most worried will happen if you don't address this?**
9. **Has this type of incident happened before?** (Pattern detection.)
10. **Who will actually own each action item?** (Named individuals, not teams.)

## Output template

```markdown
# Post-Mortem: <short descriptor> - YYYY-MM-DD

**Incident ID:** INC-YYYY-NNNN
**Severity:** P1 / P2 / P3 / P4
**Customer impact:** <service name> unavailable for <duration> / degraded for <duration>
**Status:** Resolved
**Next review at:** <date - usually 3 months after>
**Classification:** Internal / External-shareable / Redacted external

## 1. Summary
> <3-4 sentences. Readable by a non-engineer. Include what broke, who was affected, for how long, what fixed it, and whether recurrence risk is high/medium/low.>

## 2. Impact
- **Services affected:** <list>
- **Customer segments:** <segment, count if known>
- **Geographic scope:** <region / global>
- **Duration:** Start <UTC> - Detected <UTC> - Resolved <UTC>
- **Revenue / SLA impact:** <estimate or "under review">
- **Data integrity / confidentiality:** <none / see section 5>

## 3. Timeline
All times in UTC. Italic rows = events we learned about during the investigation.

| Time | Who/what | Event |
|---|---|---|
| HH:MM | <system / person> | <what happened, observably> |
| HH:MM | <system / person> | <what happened, observably> |
| HH:MM | <system / person> | <detection> |
| HH:MM | <IC> | <decision made, with what information> |
| HH:MM | <engineer> | <action taken> |
| HH:MM | <system> | <service restored> |

**What we didn't know at the time (and wish we had):**
- <...>
- <...>

## 4. Root Cause Analysis

### The one-sentence cause
<Plain English. No hedging. If we're not 100% sure, say "most likely cause" and describe the evidence.>

### Five whys
1. Why did the service fail? <symptom>
2. Why? <deeper>
3. Why? <deeper>
4. Why? <deeper>
5. Why? <root cause - stop when further whys are unactionable>

### Contributing factors (not the root cause, but made it worse or slower to detect)
- **<factor>** - <how it contributed>
- **<factor>** - <how it contributed>
- **<factor>** - <how it contributed>

## 5. Data impact
- **Data lost:** None / <description>
- **Data exposed:** None / <description>
- **Personal data involved:** Yes / No - if yes, <categories, volume, jurisdiction>
- **Regulatory notification:** Required / Not required - <regulator, deadline>

## 6. What went well
Concrete, behaviour-based. Not "the team worked hard."
- <e.g. "Detection alert fired within 45 seconds of first error.">
- <e.g. "On-call engineer correctly escalated to IC within 5 minutes.">
- <e.g. "Comms Lead sent customer update within 20 minutes of detection, beating our 30-minute SLA.">

## 7. What didn't go well
Behaviour-based, not person-based. "Monitoring gap" not "Alex should have caught it."
- <e.g. "Alert threshold was set at 5% error rate but customer impact begins at 1%.">
- <e.g. "Runbook for this failure mode did not exist, so the engineer reinvented the sequence live.">
- <e.g. "Rollback took 40 minutes because we hadn't staged the previous build as a hot standby.">

## 8. Where we got lucky
The near-miss list. Read carefully - these are free warnings.
- <e.g. "The failure happened at 03:00 UTC. If it had happened during EU business hours, revenue impact would have been ~8x.">
- <e.g. "Primary on-call was on holiday. The secondary happened to be awake and online for a different reason.">
- <e.g. "The bad config had been queued to push to all 200 devices. We only pushed to 10 before aborting.">

## 9. Action items
Each action must answer: what, who, by when, and "would this prevent recurrence?"

| # | Action | Owner | Due | Prevents recurrence? | Status |
|---|---|---|---|---|---|
| A1 | <specific, testable> | <named person> | YYYY-MM-DD | Yes / Partially / Detects-only | Open |
| A2 | <specific, testable> | <named person> | YYYY-MM-DD | Yes / Partially / Detects-only | Open |
| A3 | <specific, testable> | <named person> | YYYY-MM-DD | Yes / Partially / Detects-only | Open |

### Actions explicitly rejected (and why)
- "<idea raised in retro>" - rejected because <reason>. Revisit if <trigger>.

## 10. Supporting evidence
- Graphs / dashboards: <links, described if private>
- Related tickets: <IDs>
- Vendor case numbers: <IDs>
- Related previous incidents: <IDs + one-line tie-in>

## 11. External-facing version
Draft for status page / customer email / blog, with internal details redacted.

> Between <start UTC> and <end UTC>, some customers experienced <symptom>. We identified the cause as <plain language, no internal system names>. Service was restored at <end UTC>. We apologise for the disruption. The steps we are taking to prevent recurrence are: <2-3 bullets>. If you were affected and have questions, contact <support email>.

### Redaction checklist (before publishing externally)
- [ ] No internal system / service names
- [ ] No personal names
- [ ] No security vulnerability detail that could be weaponised
- [ ] No vendor blame that is not publicly verifiable
- [ ] Legal / PR review if impact was significant
```

## Example invocation

**User:** "/post-mortem-facilitator - we had a 90-minute outage last Thursday where logins broke for staff. Turned out the SAML cert had expired. Nobody noticed because the expiry reminder went to a shared mailbox nobody reads."

**What the skill will do:**
1. Ask detection/start/resolve times, who was on bridge, what actually fixed it (cert reissue + metadata re-upload?), blast radius (staff-only or customers?), and crucially the "what went lucky" angle (what if it had been the customer-facing IdP instead?).
2. Keep drilling on "why" - cert expired (symptom) -> reminder email ignored (miss) -> reminder went to shared inbox (miss) -> no process owner for that inbox (root) -> no automated monitoring of cert expiry (control gap).
3. Produce a post-mortem that gets action items like "add cert expiry monitoring to SAML, NTP, internal CA" with named owners, and flags the near-miss (customer IdP uses the same broken process - high-priority follow-up).

## Notes for the requester

- **"Blameless" doesn't mean "no accountability."** Actions still have owners. It means the write-up describes systems and decisions, not individuals' failings. People make mistakes; the systems around them should catch those mistakes.
- **If you don't know something, say "unknown."** "Unknown" is a valid timeline entry. Speculation ("the engineer probably did X") corrodes the value of the document.
- **Timelines from Slack are usually wrong by 2-3 minutes.** Cross-reference with the monitoring tool's timestamps if accuracy matters.
- **Action items must be testable.** "Improve monitoring" is not an action. "Add Prometheus alert for certificate expiry <14 days, page on-call" is an action.
- **"Where we got lucky" is the most valuable section.** Spend as much time here as you do on root cause. Near-misses are where the next P1 lives.
- **"Good" looks like:** someone who wasn't on the incident reads it, understands exactly what happened and why, and can identify whether their area has the same latent risk.
