---
name: firmware-upgrade-planner
description: Produces a fleet-wide firmware upgrade plan with risk ordering, maintenance windows, rollback, and per-device verification.
version: 1.0.0
author: VantagePoint Networks
audience: IT Managers, Network Engineers, Infrastructure Leads, Change Managers
output_format: Formatted Markdown upgrade plan with waves, per-device procedure, rollback, verification, CAB record, and risk register entries.
license: MIT
---

# Firmware Upgrade Planner

A structured planning conversation that turns "we need to upgrade all the switches this quarter" into a wave-by-wave plan that is safe, auditable, and unlikely to be your worst weekend.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. Run `/firmware-upgrade-planner` in Claude Code. Paste your device inventory (or describe it) and the target firmware version. Answer scope questions. Receive the plan.

## When to use this

- You've got a security advisory / CVE mandate to upgrade N devices within X weeks.
- You're doing the quarterly fleet-wide firmware hygiene pass and want a structured plan.
- You're following a vendor EoL pathway and need a sequenced migration, not a single big-bang.
- You're raising a major change to CAB and need a defensible wave / window / risk paper.
- You're handing the plan to an MSP or on-call engineer and need it to be executable by someone other than you.

## What you'll get

- **Fleet snapshot** - current versions, target version, deltas, criticality.
- **Wave plan** - grouped deployment order (lab -> lowest-risk prod -> tier 2 -> tier 1 / critical), with spacing and success gates between waves.
- **Per-device upgrade procedure** - template steps with pre/during/post checks.
- **Rollback plan** - specific to each vendor and upgrade path (some rollbacks are one-way).
- **Maintenance window schedule** - proposed dates, conflicts with freezes and other changes.
- **Verification checklist** - technical tests + business-facing tests per device.
- **Risk register deltas** - new risks introduced, old risks retired by the upgrade.
- **CAB-ready change record** - or, for emergencies, a compressed emergency-change justification.
- **Communications plan** - stakeholder notifications at wave boundaries.

## Clarifying questions I will ask you

1. **How many devices, which vendor, which models?** (More detail = better sequencing.)
2. **What's the current firmware version range across the fleet?**
3. **What's the target version, and why that version specifically?** (Security fix, feature requirement, vendor EoL, internal baseline)
4. **Is the upgrade path direct, or does it require intermediate versions?** (Some paths need 16.x -> 17.x -> 18.x, not direct.)
5. **What's driving the timeline?** (CVE severity, audit deadline, project milestone, EoL expiry)
6. **Do you have a lab device that matches production?** (Changes risk posture.)
7. **What's your rollback capability?** (Dual-image? Single-image with TFTP? Vendor console only?)
8. **What windows are available?** (Weekend nights only? Any business days? Freeze periods to avoid?)
9. **Any devices in a highly-available pair?** (HA failover is your best friend here.)
10. **What tooling will execute the upgrades?** (Manual per-device, Ansible, vendor's NMS, zero-touch)
11. **Who signs off wave gates?** (Each wave requires an explicit go/no-go.)
12. **Any devices with special constraints?** (DC switches during trading hours, hospital switches during clinical hours, etc.)

## Output template

```markdown
# Firmware Upgrade Plan: <vendor / platform> <current> -> <target>

**Plan ID:** FW-UPG-YYYY-NNN
**Drafted:** YYYY-MM-DD
**Owner:** <name>
**Target completion:** YYYY-MM-DD
**Driver:** <CVE / EoL / baseline / feature>

## 1. Executive Summary
> <One paragraph: what, why, how many devices, how long it will take, worst-case rollback cost, residual risk. Answerable in 30 seconds by a director.>

## 2. Fleet Snapshot
| Metric | Value |
|---|---|
| Devices in scope | N |
| Current version spread | <e.g. "17.9.5 (12 devices), 17.12.3 (8 devices)"> |
| Target version | <e.g. "17.15.1a"> |
| Upgrade path | Direct / Via <intermediate> |
| Vendor advisory ref | <URL / ID> |
| Risk rating (inherent) | Low / Medium / High |

## 3. Pre-requisites
- [ ] Backup of every device's running config + startup config, stored off-device
- [ ] Inventory record with serial, model, current version, mgmt IP
- [ ] Support contract active on all devices (required for TAC engagement mid-incident)
- [ ] Rollback image staged / available via <method>
- [ ] Maintenance windows approved
- [ ] Comms drafted for stakeholders at each wave boundary
- [ ] Vendor advisory / release notes reviewed for known issues
- [ ] Device-specific test in lab against known-good reference config

## 4. Waves

### Wave 0 - Lab
| Device(s) | Scheduled | Go/no-go owner |
|---|---|---|
| <lab device> | <date> | <name> |

**Success gate to proceed to Wave 1:**
- [ ] Upgrade completes without error
- [ ] Running config preserved byte-for-byte (diff check)
- [ ] Forwarding, routing protocols, interface states match pre-upgrade
- [ ] Any vendor-specific known issue tested
- [ ] Rollback tested on the lab device and succeeds

### Wave 1 - Lowest-risk production (non-critical access layer, branch offices)
| Device | Role | Location | Window | Rollback window cuts at |
|---|---|---|---|---|
| <hostname> | <role> | <site> | YYYY-MM-DD HH:MM | HH:MM |
| <hostname> | <role> | <site> | YYYY-MM-DD HH:MM | HH:MM |

**Pacing:** <N> devices per window, <N> minutes between each within a window.
**Success gate to proceed to Wave 2:** 72 hours of clean operation on Wave 1 devices; no related incident tickets; monitoring confirms no silent regression.

### Wave 2 - Tier 2 (distribution, non-critical cores)
<same structure>

### Wave 3 - Tier 1 (core, edge, critical HA pairs)
<same structure - usually slowest pacing, 1 device per window, with bridge call>

### Wave 4 - Special / high-constraint devices
<same structure - each device gets its own change ticket and longer window>

## 5. Per-Device Upgrade Procedure (template)
Replace variables in angle brackets with the specific device.

### Pre-upgrade (T-15 minutes)
1. Confirm maintenance window active.
2. Take config backup: `<vendor-specific command>`. Verify backup file size > 0.
3. Capture baseline state:
   - Uptime
   - Interface counts and status
   - Routing protocol neighbour counts
   - BGP / OSPF / STP convergence check
4. Silence monitoring alerts for the device: <how>.
5. Announce on bridge: "Starting upgrade of <hostname>."

### Upgrade (T-0)
1. Verify target image is present: `<command>`.
2. Verify hash / signature of image: `<command>`.
3. Set boot variable / next-boot image: `<command>`.
4. Save config.
5. Reload / install command: `<command>`.
6. Watch console / monitoring for device to return.

### Post-upgrade (T+N)
1. Confirm device reachable: SSH + ping mgmt.
2. Confirm OS version: `<command>`. Expect: <target version>.
3. Compare baseline state to post-upgrade state - must match:
   - Interface counts + status
   - Neighbour counts
   - Forwarding table entry count (within tolerance)
4. Run any vendor-specific smoke test: <command>.
5. Business-facing check: <specific test, e.g. "staff can reach intranet from VLAN 20">.
6. Remove monitoring silence.
7. Announce on bridge: "<hostname> upgrade complete and verified."

## 6. Rollback Plan
Triggered if: <specific failure condition - e.g. "device does not return within 10 minutes", "routing protocol does not re-establish within 5 minutes post-boot", "any P1 business-facing impact">

### Rollback procedure (dual-image devices)
1. Set boot variable back to previous image: `<command>`.
2. Reload: `<command>`.
3. Verify previous version on boot.
4. Verify baseline state matches pre-upgrade.

### Rollback procedure (single-image devices)
1. TFTP / SCP previous image to device: `<command>`.
2. Set boot variable: `<command>`.
3. Reload: `<command>`.
4. Verify.

**Point of no return:** <the moment after which rollback is no longer viable - name it explicitly for each wave>

## 7. Maintenance Windows
Proposed schedule. Aligned to change freeze calendar.

| Window | Date | Time (UTC) | Wave | Devices | Conflicts checked? |
|---|---|---|---|---|---|
| W1 | YYYY-MM-DD | HH:MM-HH:MM | Wave 0 Lab | <lab> | Y |
| W2 | YYYY-MM-DD | HH:MM-HH:MM | Wave 1 | <list> | Y |
| W3 | YYYY-MM-DD | HH:MM-HH:MM | Wave 2 | <list> | Y |
| W4 | YYYY-MM-DD | HH:MM-HH:MM | Wave 3 | <list> | Y |

## 8. Risk Register Deltas
### New risks introduced by upgrading
- <risk> - L x I = N - mitigated by <control>

### Risks retired by upgrading
- <CVE or feature> - Critical - retired on successful Wave N completion

## 9. CAB Record
Use this as the body of the change request. (Or use the `/change-request-writer` skill.)

- **Change:** Upgrade <vendor> fleet from <current> to <target>
- **Type:** Normal (if planned waves) / Standard (if vendor-categorised low-risk)
- **Risk:** <score> - driven by <main risk>
- **Backout:** Yes - dual image / staged rollback path
- **Windows:** See section 7
- **Implementer:** <team>
- **Approver:** Change Manager / CAB

## 10. Communications Plan
| Audience | Trigger | Channel | Owner |
|---|---|---|---|
| All staff | Before Wave 1 | Email | IT Ops |
| Site leads | Before their site's device | Email | IT Ops |
| Executive sponsor | After each wave completion | Short email | Upgrade Owner |
| Customers | Only if customer-facing impact expected | Status page | Comms Lead |

## 11. Success Criteria (programme-level)
- [ ] All devices on target version by YYYY-MM-DD
- [ ] Zero P1/P2 incidents attributable to upgrade
- [ ] <N> hours of total scheduled downtime, actual within tolerance
- [ ] All related CVEs retired from risk register
- [ ] Post-programme review scheduled

## 12. Appendix
- Vendor release notes URL
- Known issues list
- Per-device inventory spreadsheet (attach)
- Baseline state captures per device (attach)
```

## Example invocation

**User:** "/firmware-upgrade-planner - we need to upgrade 24 Cisco Catalyst 9200L switches from 17.9.5 to 17.12.4 because of a medium-severity CVE. Mix of access switches at HQ and branches."

**What the skill will do:**
1. Ask about HA posture (usually access switches aren't HA - so upgrade = downtime), maintenance windows available, lab device availability, rollback path (dual-image on Catalyst 9200L is standard), and business constraints.
2. Propose Wave 0 lab, Wave 1 three branch devices (staggered Mon-Wed-Fri evenings), Wave 2 HQ access switches (weekend), with 72-hour success gates between waves.
3. Produce the per-device procedure with exact IOS-XE commands for Catalyst 9200L.
4. Flag that the jump 17.9.5 -> 17.12.4 is direct per Cisco matrix, note any intermediate steps required by specific features.
5. Generate a CAB record that can be pasted into the change tool.

## Notes for the requester

- **Never upgrade without a lab test.** If you don't have a lab device, at least upgrade one non-critical branch device first and sit on it for 72 hours.
- **72-hour soak is a rule of thumb.** Some upgrades regress only under specific traffic patterns - 72 hours usually surfaces them. Adjust by risk tolerance.
- **Vendor release notes are mandatory reading.** "Known issues" section is where the landmines live. The skill will prompt you to confirm you've read it.
- **HA pairs upgrade one side, wait, then the other.** Never both in the same window.
- **Rollback is a one-way door on some platforms.** On older IOS and some Juniper releases, rolling back after certain config migrations isn't clean. Confirm with the vendor TAC before planning the wave.
- **"Good" looks like:** someone other than you can run the plan. Wave gates are explicit and measurable. The CAB reads the plan and approves without follow-up questions.