---
name: incident-responder
description: Walks an IT manager through a live or recent incident and produces a timeline, root cause analysis, and post-incident review ready for distribution.
version: 1.0.0
author: VantagePoint Networks
audience: IT Managers, Service Delivery Managers, Incident Commanders, Problem Managers
output_format: Formatted Markdown incident report with timeline, 5-whys RCA, stakeholder update, and remediation plan.
license: MIT
---

# Incident Responder

A structured conversation that produces a defensible post-incident record without forcing you to learn ITIL v4 or NIST SP 800-61 by heart.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. Run `/incident-responder` in Claude Code. Describe the incident - you can paste ticket notes, Slack threads, or just tell the story.
4. Answer the questions as they come. The skill will produce a full report you can share with leadership, customers, or regulators.

## When to use this

- **Mid-incident:** you need to structure the bridge call and keep a running timeline. Paste updates as they come in.
- **Immediately after resolution:** you have 4 hours to send a stakeholder update and do not want to miss anything.
- **Next morning:** formal PIR due to the business or a regulator. You have Slack threads, ticket notes, and fragmented memory.
- **Post-mortem meeting prep:** you want a draft for the team to refine, not a blank document.
- **Regulator or contractual notification:** you need a fact-only chronology without speculation.

## What you'll get

- **Executive summary** (3-4 sentences, no jargon) suitable for the CEO or a customer.
- **Impact statement** with affected services, customers, duration, and financial/reputational estimate.
- **Timeline** of events, detections, actions, and decisions (UTC + local time columns).
- **Five-whys root cause analysis** with contributing factors called out separately.
- **Remediation actions** - immediate, short-term, long-term - with owners and due dates.
- **Stakeholder communications pack** - internal all-hands, customer-facing update, and executive one-pager.
- **Lessons learned** section suitable for retrospective and for adding to the runbook library.

## Clarifying questions I will ask you

1. **Is this incident active right now, or is it over?** (This changes the tone, the urgency, and whether I propose containment actions.)
2. **In one line, what was the customer-visible symptom?** ("Users couldn't log in", "Checkout failed", "Site was unreachable from outside")
3. **When was it first detected, and how?** (Monitoring alert, user report, third-party notification, security finding)
4. **When did it actually start?** (Often earlier than detection - look at logs/graphs.)
5. **What was the resolution action, and what time did normal service return?**
6. **What services, customer segments, and regions were affected?**
7. **Did data get lost, corrupted, or exposed?** (Drives regulator/breach track.)
8. **Who responded, and what role did each person play?** (Commander, scribe, comms, SME)
9. **What is your best current hypothesis for the root cause?**
10. **Any dependencies you already know contributed?** (Cloud provider, ISP, third-party SaaS)

## Output template

```markdown
# Incident Report: INC-YYYY-NNNN <short descriptor>

## 1. Executive Summary
> <3-4 sentences, readable by a non-technical executive. Covers: what broke, who was affected, for how long, what we did, whether it can happen again.>

## 2. At a Glance
| Field | Value |
|---|---|
| Incident ID | INC-YYYY-NNNN |
| Severity | P1 / P2 / P3 / P4 |
| Status | Resolved / Monitoring / Active |
| Detected | YYYY-MM-DD HH:MM UTC (via <source>) |
| Started (actual) | YYYY-MM-DD HH:MM UTC |
| Resolved | YYYY-MM-DD HH:MM UTC |
| Duration | <hh:mm> |
| Impact | <services + users + region> |
| Data loss? | None / Yes (see section 7) |
| Data exposure? | None / Yes (see section 7) |

## 3. Impact
- **Services affected:** <list>
- **Customer segments affected:** <segment, count if known>
- **Geographic scope:** <region / global>
- **User-facing symptom:** <plain language>
- **Estimated revenue impact:** <figure or "under assessment">
- **SLA breach?** Yes / No / Pending calculation
- **Contractual notification required?** Yes / No - to <party> by <time>

## 4. Timeline
All times in <TZ>. Most recent event at the bottom.

| Time (local) | Time (UTC) | Who | Event |
|---|---|---|---|
| HH:MM | HH:MM | <system / person> | <observation, action, or decision> |
| HH:MM | HH:MM | <system / person> | <observation, action, or decision> |
| HH:MM | HH:MM | <system / person> | <observation, action, or decision> |

## 5. Response Roles
| Role | Name | Notes |
|---|---|---|
| Incident Commander | | Ran the bridge, made call-outs |
| Technical Lead | | Drove diagnosis and fix |
| Communications Lead | | Wrote and sent stakeholder updates |
| Scribe | | Maintained the timeline live |
| Executive Sponsor | | Approved customer-facing statements |

## 6. Root Cause Analysis - Five Whys
1. **Why did the service fail?** <technical symptom>
2. **Why?** <deeper cause>
3. **Why?** <deeper cause>
4. **Why?** <deeper cause>
5. **Why?** <root cause>

### Contributing factors (not the root cause, but made it worse or slowed recovery)
- <factor 1>
- <factor 2>
- <factor 3>

### What would have prevented or detected this sooner?
- <control 1>
- <control 2>

## 7. Data Impact Assessment
- **Data lost:** None / <description + quantity>
- **Data exposed:** None / <description + who could see it + is access logged>
- **Personal data involved?** Yes / No - if yes: <categories, number of subjects, jurisdiction>
- **Regulatory notification triggered?** Yes / No - <regulator> by <deadline>
- **Backup / recovery action taken:** <yes/no/partial>

## 8. Remediation
### Immediate (within 24 hours)
| Action | Owner | Due |
|---|---|---|
| <action> | <owner> | <date> |

### Short-term (within 2 weeks)
| Action | Owner | Due |
|---|---|---|
| <action> | <owner> | <date> |

### Long-term (within 1 quarter)
| Action | Owner | Due |
|---|---|---|
| <action> | <owner> | <date> |

## 9. Communications Pack
### Internal all-hands update (send in Slack / Teams)
> We had a P<N> incident today between <start> and <end>. Customers experienced <symptom>. Root cause was <plain language>. <Team> fixed it by <action>. Full write-up will be circulated within <N> working days.

### Customer-facing update (status page / email)
> Between <start UTC> and <end UTC>, some customers were unable to <action>. We identified the cause as <plain language, no internal system names>. Service was fully restored at <end UTC>. We apologise for the disruption and are taking the following steps to prevent recurrence: <1-2 bullets>. If you believe you were affected and have questions, contact <support email>.

### Executive one-pager
> **What happened:** <1 sentence>
> **Impact:** <1 sentence with numbers>
> **Root cause:** <1 sentence>
> **Fix:** <1 sentence>
> **Recurrence risk:** <High/Medium/Low> - <1 sentence explaining why>
> **Next steps:** <3 bullets>

## 10. Lessons Learned
- **What went well:** <bullet list>
- **What didn't go well:** <bullet list>
- **Where we got lucky:** <bullet list - things that could have made it worse but didn't>
- **Runbook / monitoring / process updates required:** <bullet list linked to remediation owners>

## 11. Appendix
- Relevant graphs / dashboards: <links>
- Ticket references: <IDs>
- Vendor case numbers: <IDs>
- Related previous incidents: <IDs>
```

## Example invocation

**User:** "We had a 40-minute outage this afternoon where customers couldn't check out. Stripe webhook stopped coming through. Fixed by restarting our webhook handler service."

**What the skill will do:**
1. Ask clarifying questions to pin down start vs detection time, blast radius (all customers or a subset), whether refunds or duplicate charges happened, and who was on the bridge.
2. Note that "restart fixed it" is a symptom, not a root cause, and keep asking "why" until there's a plausible deeper cause (queue backing up? memory leak? cert rotation?).
3. Produce the full report with a customer-facing update that can go on the status page within minutes, and a remediation list that distinguishes "restart more aggressively" from "find the memory leak."

## Notes for the requester

- **Keep talking even if you don't know.** "Unknown" is a valid answer. The skill will surface gaps rather than guess.
- **Paste raw.** Slack threads, ticket comments, vendor RFC responses - paste them all. The skill will extract the timeline.
- **Don't assign blame.** Ever. The output is fact-only. People issues belong in a separate retrospective, not in a report that may be shared externally.
- **If a regulator is involved**, tell the skill at the start. It will flag where you need to be extra precise and will not soften language in a way that could create legal risk.
- **"Good" looks like:** a director can read sections 1, 2, and 3 in 60 seconds and understand the situation. A new engineer can read sections 4, 6, and 8 and learn from it.
