---
name: network-triage
description: Given a symptom ("VPN tunnel flapping", "random Teams drops", "SaaS slow from branch 3"), walk a layered diagnostic ladder and output a targeted investigation plan plus the data capture commands to run.
version: 1.0.0
author: VantagePoint Networks
audience: Network Engineers, NOC Analysts, Platform Engineers, SREs
output_format: Markdown — symptom summary, working hypothesis ladder, prioritised diagnostic commands, decision tree.
license: MIT
---

# Network Triage

A Claude Code skill for the 30 minutes between "it's broken" and "we know why". Converts a symptom description into a structured investigation plan so a junior engineer can follow it under pressure.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. In Claude Code, run `/network-triage`. Describe the symptom, the environment, and when it started.

## When to use this

- A user-reported "the internet is slow" complaint where you don't know where to start.
- A recurring intermittent issue that nobody has structured their investigation of.
- On-call runbook starter when a junior engineer is lead-investigator.
- Post-incident — retrospectively, to check whether the investigation path was sound.
- Training — show the layered thinking new engineers should adopt.

## What you'll get

A Markdown diagnostic plan with:

1. **Symptom normalisation** — what you said, what I understood.
2. **Working hypothesis ladder** — 3–5 hypotheses ranked by likelihood.
3. **Diagnostic plan** — ordered list of commands / checks to execute, with expected output for each.
4. **Decision tree** — branch on the first significant finding.
5. **Escalation path** — when to stop self-serve and call a vendor / another team.
6. **Data capture** — what to save for the post-incident review.

## Clarifying questions I will ask you

1. **Symptom in one sentence** — "VPN tunnel flaps every 15–30 min between HQ and branch-05".
2. **Scope** — one user / one site / one service / a whole region?
3. **When did it start?** — specific time, or "noticed today".
4. **What changed recently?** — deploys, firmware, weather, ISP maintenance.
5. **Who is affected?** — specific groups, any VIP/customer impact?
6. **What have you tried?** — so I don't re-suggest those.
7. **Platform mix** — vendors at play (e.g., Cisco ASA + Palo Alto + Meraki).

## How I think about network problems

I walk the layers, top-down or bottom-up depending on symptom:

1. **Layer 1 / physical** — link, optics, fibre, copper, cable path.
2. **Layer 2 / switching** — VLAN, trunk, STP, CAM table, CRC errors.
3. **Layer 3 / routing** — ARP, routing table, next-hop reachability, BGP/OSPF/EIGRP state.
4. **Layer 4 / transport** — TCP handshake, session tracking, NAT, asymmetric paths.
5. **Security controls** — firewall deny, IPS drop, ACL, WAF.
6. **Application** — DNS, TLS handshake, auth token expiry, API rate limits.
7. **Environment** — time sync, certificates, MTU, MSS clamping.

Most "the internet is slow" issues are actually:
- **DNS** (50% of reported "slow" problems — trust nothing else first).
- **TLS handshake latency** (rising with cloud apps; look at chain + session resumption).
- **MTU / MSS** over IPsec / GRE (especially after a recent change).
- **Asymmetric routing** breaking stateful firewalls.
- **Upstream ISP** that's silently rate-limiting.

The hypothesis ladder reflects base rates.

## Example output

Input: *"VPN tunnel between HQ and branch-07 flaps every ~20 minutes. Started yesterday after firmware patch on HQ Palo Alto. 30 users at branch lose access each time. HQ PA is PAN-OS 11.1.3."*

```markdown
# Network triage — VPN flap HQ ↔ branch-07

## Symptom summary
IPsec tunnel instability, ~20 min period, correlated with HQ firewall firmware change to PAN-OS 11.1.3.

## Working hypotheses (by likelihood)
1. **Post-upgrade IPsec regression in PAN-OS 11.1.3** — check vendor bulletins, known issues, changelog.
2. **DPD (dead peer detection) interval mismatch** — maybe defaults changed in the new version.
3. **Rekey interval / phase-2 lifetime change** — symptomatic 20-min mark aligns with default p2 lifetime of many Fortinet/PA pairings.
4. **MTU / fragmentation** — less likely given it worked before but check.
5. **Branch-side issue coincidental with HQ upgrade** — confirm by checking branch logs, not just HQ.

## Diagnostic plan (in order)

### A. Scope + confirm
1. `show system info` on HQ PA — confirm 11.1.3 is live.
2. On HQ PA: `show log system severity equal high direction equal backward last-recv` — grab the last 24h.
3. On HQ PA: `show vpn ipsec-sa | match branch-07` — current tunnel state.
4. On branch-07 peer: equivalent VPN state command (depends on device).

### B. Check known issues
5. Search the PAN PSIRT + release notes for "11.1.3 IPsec" — flagged regressions?
6. Check customer portal for advisories issued since the patch date.

### C. DPD + rekey timers
7. On both sides: dump phase-1 + phase-2 timers. Compare.
8. Check if HQ upgrade introduced a default-value change (phase-1 lifetime, DPD interval).

### D. MTU + fragmentation
9. Capture on HQ IPsec peer: `tcpdump -i eth1 host <branch-07-peer-ip> and esp -vv -w /tmp/ipsec.pcap`.
10. Run during a known flap, grab 2–3 minutes around an event.

### E. Branch side
11. Branch peer logs during a flap — does the branch side see the session torn?
12. Branch WAN quality metrics (loss, jitter, latency) — rule out ISP.

## Decision tree

- **If vendor bulletin confirms regression** → open ticket with PAN support, downgrade or apply hotfix; add known-issue workaround to VPN config.
- **If DPD mismatch** → align DPD interval/retries on both sides.
- **If timers changed silently** → set explicit timers in config; don't rely on defaults.
- **If MTU fragmentation** → lower MSS clamp on the tunnel interface.
- **If branch-side fault** → flip investigation to branch ISP / hardware.

## Escalation path
- After step 5 without root cause: open PAN TAC ticket with HQ + branch configs + traces attached.
- After 2 hours without resolution + branch 30 users still dropping: declare SEV2; invoke `/outage-comms-writer` for user communications.

## Data to save for the post-incident
- HQ PA `show system info`, `show vpn ipsec-sa` exports.
- Branch-07 peer equivalent.
- `/tmp/ipsec.pcap` from step 9.
- Full firewall system log severity >= info for the 24h around the upgrade.
- PAN-OS 11.1.3 release notes snapshot.
- Your change ticket for the upgrade.
```

## What I won't do

- I won't pretend certainty about root cause from symptom alone; I'll rank hypotheses and let the data confirm.
- I won't suggest destructive commands (reboot, factory reset) as a first step.
- I won't bypass change control for permanent fixes; I'll mark interim workarounds separately.
- I won't leave you without a bail-out — every path has an escalation line.

## Reference

- Layer-by-layer troubleshooting heritage (Cisco TAC methodology, 1990s onward).
- [Google SRE Workbook](https://sre.google/workbook/) — structured triage principles.
- Vendor PSIRT + release-notes habit (most "new bug" incidents start at recently-changed code).

## Attribution

Built by **Hak** at **VantagePoint Networks**. MIT licensed.
