--- name: incident-responder description: Walks an IT manager through a live or recent incident and produces a timeline, root cause analysis, and post-incident review ready for distribution. version: 1.0.0 author: VantagePoint Networks audience: IT Managers, Service Delivery Managers, Incident Commanders, Problem Managers output_format: Formatted Markdown incident report with timeline, 5-whys RCA, stakeholder update, and remediation plan. license: MIT --- # Incident Responder A structured conversation that produces a defensible post-incident record without forcing you to learn ITIL v4 or NIST SP 800-61 by heart. ## How to use this skill 1. Download this `SKILL.md` file. 2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows). 3. Run `/incident-responder` in Claude Code. Describe the incident - you can paste ticket notes, Slack threads, or just tell the story. 4. Answer the questions as they come. The skill will produce a full report you can share with leadership, customers, or regulators. ## When to use this - **Mid-incident:** you need to structure the bridge call and keep a running timeline. Paste updates as they come in. - **Immediately after resolution:** you have 4 hours to send a stakeholder update and do not want to miss anything. - **Next morning:** formal PIR due to the business or a regulator. You have Slack threads, ticket notes, and fragmented memory. - **Post-mortem meeting prep:** you want a draft for the team to refine, not a blank document. - **Regulator or contractual notification:** you need a fact-only chronology without speculation. ## What you'll get - **Executive summary** (3-4 sentences, no jargon) suitable for the CEO or a customer. - **Impact statement** with affected services, customers, duration, and financial/reputational estimate. - **Timeline** of events, detections, actions, and decisions (UTC + local time columns). - **Five-whys root cause analysis** with contributing factors called out separately. - **Remediation actions** - immediate, short-term, long-term - with owners and due dates. - **Stakeholder communications pack** - internal all-hands, customer-facing update, and executive one-pager. - **Lessons learned** section suitable for retrospective and for adding to the runbook library. ## Clarifying questions I will ask you 1. **Is this incident active right now, or is it over?** (This changes the tone, the urgency, and whether I propose containment actions.) 2. **In one line, what was the customer-visible symptom?** ("Users couldn't log in", "Checkout failed", "Site was unreachable from outside") 3. **When was it first detected, and how?** (Monitoring alert, user report, third-party notification, security finding) 4. **When did it actually start?** (Often earlier than detection - look at logs/graphs.) 5. **What was the resolution action, and what time did normal service return?** 6. **What services, customer segments, and regions were affected?** 7. **Did data get lost, corrupted, or exposed?** (Drives regulator/breach track.) 8. **Who responded, and what role did each person play?** (Commander, scribe, comms, SME) 9. **What is your best current hypothesis for the root cause?** 10. **Any dependencies you already know contributed?** (Cloud provider, ISP, third-party SaaS) ## Output template ```markdown # Incident Report: INC-YYYY-NNNN ## 1. Executive Summary > <3-4 sentences, readable by a non-technical executive. Covers: what broke, who was affected, for how long, what we did, whether it can happen again.> ## 2. At a Glance | Field | Value | |---|---| | Incident ID | INC-YYYY-NNNN | | Severity | P1 / P2 / P3 / P4 | | Status | Resolved / Monitoring / Active | | Detected | YYYY-MM-DD HH:MM UTC (via ) | | Started (actual) | YYYY-MM-DD HH:MM UTC | | Resolved | YYYY-MM-DD HH:MM UTC | | Duration | | | Impact | | | Data loss? | None / Yes (see section 7) | | Data exposure? | None / Yes (see section 7) | ## 3. Impact - **Services affected:** - **Customer segments affected:** - **Geographic scope:** - **User-facing symptom:** - **Estimated revenue impact:**

- **SLA breach?** Yes / No / Pending calculation - **Contractual notification required?** Yes / No - to by ## 4. Timeline All times in . Most recent event at the bottom. | Time (local) | Time (UTC) | Who | Event | |---|---|---|---| | HH:MM | HH:MM | | | | HH:MM | HH:MM | | | | HH:MM | HH:MM | | | ## 5. Response Roles | Role | Name | Notes | |---|---|---| | Incident Commander | | Ran the bridge, made call-outs | | Technical Lead | | Drove diagnosis and fix | | Communications Lead | | Wrote and sent stakeholder updates | | Scribe | | Maintained the timeline live | | Executive Sponsor | | Approved customer-facing statements | ## 6. Root Cause Analysis - Five Whys 1. **Why did the service fail?** 2. **Why?** 3. **Why?** 4. **Why?** 5. **Why?** ### Contributing factors (not the root cause, but made it worse or slowed recovery) - - - ### What would have prevented or detected this sooner? - - ## 7. Data Impact Assessment - **Data lost:** None / - **Data exposed:** None / - **Personal data involved?** Yes / No - if yes: - **Regulatory notification triggered?** Yes / No - by - **Backup / recovery action taken:** ## 8. Remediation ### Immediate (within 24 hours) | Action | Owner | Due | |---|---|---| | | | | ### Short-term (within 2 weeks) | Action | Owner | Due | |---|---|---| | | | | ### Long-term (within 1 quarter) | Action | Owner | Due | |---|---|---| | | | | ## 9. Communications Pack ### Internal all-hands update (send in Slack / Teams) > We had a P incident today between and . Customers experienced . Root cause was . fixed it by . Full write-up will be circulated within working days. ### Customer-facing update (status page / email) > Between and , some customers were unable to . We identified the cause as . Service was fully restored at . We apologise for the disruption and are taking the following steps to prevent recurrence: <1-2 bullets>. If you believe you were affected and have questions, contact . ### Executive one-pager > **What happened:** <1 sentence> > **Impact:** <1 sentence with numbers> > **Root cause:** <1 sentence> > **Fix:** <1 sentence> > **Recurrence risk:** - <1 sentence explaining why> > **Next steps:** <3 bullets> ## 10. Lessons Learned - **What went well:** - **What didn't go well:** - **Where we got lucky:** - **Runbook / monitoring / process updates required:** ## 11. Appendix - Relevant graphs / dashboards: - Ticket references: - Vendor case numbers: - Related previous incidents: ``` ## Example invocation **User:** "We had a 40-minute outage this afternoon where customers couldn't check out. Stripe webhook stopped coming through. Fixed by restarting our webhook handler service." **What the skill will do:** 1. Ask clarifying questions to pin down start vs detection time, blast radius (all customers or a subset), whether refunds or duplicate charges happened, and who was on the bridge. 2. Note that "restart fixed it" is a symptom, not a root cause, and keep asking "why" until there's a plausible deeper cause (queue backing up? memory leak? cert rotation?). 3. Produce the full report with a customer-facing update that can go on the status page within minutes, and a remediation list that distinguishes "restart more aggressively" from "find the memory leak." ## Notes for the requester - **Keep talking even if you don't know.** "Unknown" is a valid answer. The skill will surface gaps rather than guess. - **Paste raw.** Slack threads, ticket comments, vendor RFC responses - paste them all. The skill will extract the timeline. - **Don't assign blame.** Ever. The output is fact-only. People issues belong in a separate retrospective, not in a report that may be shared externally. - **If a regulator is involved**, tell the skill at the start. It will flag where you need to be extra precise and will not soften language in a way that could create legal risk. - **"Good" looks like:** a director can read sections 1, 2, and 3 in 60 seconds and understand the situation. A new engineer can read sections 4, 6, and 8 and learn from it.