---
name: bgp-troubleshooter
description: Paste BGP neighbor symptoms / output and receive a ranked diagnosis with specific remediation commands for Cisco, Juniper, FRR, Arista, and BIRD. Covers peer-down, flapping, missing routes, route leaks, and AS-path weirdness.
version: 1.0.0
author: VantagePoint Networks
audience: Network Engineers, SRE, NOC Analysts, Peering Coordinators
output_format: Markdown — diagnosis ranked by likelihood, vendor-specific fix commands, verification steps.
license: MIT
---

# BGP Troubleshooter

A Claude Code skill focused on BGP pain. Consumes neighbor output, `show ip bgp` dumps, log snippets, or a narrative description — returns ranked hypotheses with concrete fix commands.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. In Claude Code, run `/bgp-troubleshooter`. Paste the symptom description + available CLI output.

## When to use this

- A BGP session went down and isn't coming back.
- A session is established but missing routes.
- The wrong path is being selected.
- Route flapping is causing intermittent outages.
- You're setting up a new peer and the session won't come up.
- You're seeing AS-path weirdness suggesting a leak or hijack.

## What you'll get

A Markdown diagnostic with:

1. **Symptom normalisation.**
2. **Top 3–5 hypotheses ranked by likelihood** — with reasoning.
3. **Verification commands per hypothesis** — in your platform's syntax.
4. **Fix commands** — once the cause is confirmed.
5. **Post-fix verification** — what success looks like.
6. **Prevention** — config hardening to stop this recurring.

## Clarifying questions I will ask you

1. **Platform(s)** — Cisco IOS/XR/NX-OS, Juniper Junos, Arista EOS, FRR, BIRD, MikroTik.
2. **Session type** — iBGP / eBGP, single-hop / multi-hop, IPv4 / IPv6 / VPNv4 / EVPN.
3. **Symptom** — not coming up, flapping, established-but-no-routes, wrong path, route leak/hijack.
4. **CLI output** — paste `show bgp summary / neighbors / ipv4 unicast` or equivalent.
5. **Log snippet** — any state-transition messages from the affected neighbor.
6. **Recent changes** — config, firmware, peer's side.

## Hypothesis catalogue

I match symptoms against common BGP failure modes:

### Session won't come up (Idle / Active / OpenSent)
- **TCP 179 blocked** — by an ACL, upstream firewall, or local control-plane policing.
- **Source address mismatch** — your neighbor statement says the wrong address.
- **Multi-hop missing** — eBGP more than 1 hop needs `ebgp-multihop` on Cisco, `multihop` on Juniper.
- **AS number mismatch** — you configured the wrong remote-as.
- **Password mismatch** — MD5 peering password differs.
- **TTL security mismatch** — `ttl-security` on one side only.
- **Route to peer missing** — you can't ping the peer IP.

### Session up but no routes
- **Missing network statements / redistribution** — nothing being originated.
- **Inbound filter dropping everything** — prefix-list / route-map / filter too strict.
- **Soft-reconfig inbound not set** when policy changed — no routes re-announced.
- **Maximum-prefix hit** — `show bgp summary` shows PfxRcd capped.
- **Next-hop unreachable** — iBGP loopback next-hop but no route to loopback.
- **AFI/SAFI not negotiated** — IPv6 or VPNv4 capability not enabled.
- **Nexthop-self missing** on a route reflector.

### Flapping
- **DPD / hold-time expiring** — underlying link flap or BFD not configured where it should be.
- **CPU / memory exhausted** — control-plane starved.
- **MTU mismatch** — large BGP updates fail fragmentation.
- **Route churn** hitting dampening.
- **ISP / transit flap** — problem upstream of your peer.

### Wrong path selection
- **Local-preference / MED mis-set.**
- **AS-path prepending ineffective.**
- **Weight set by accident overrides everything.**
- **Community not matched** by your ingress policy.

### Route leak / hijack suspicion
- **Customer announcing upstream prefixes back** (common leak pattern).
- **Unexpected AS path** — new originating AS for your prefix.
- **RPKI invalid** — your policy drops invalids, peer's announcement doesn't match ROA.

## Example output

Input: *"Cisco ASR-9k, eBGP to new transit provider 64512 → 65001. Session flaps between Established and Idle every 90 seconds. BGP summary shows 'up 00:01:30' then resets."*

```markdown
# BGP triage — eBGP to 65001 flapping

## Hypotheses (by likelihood)

1. **Hold-time expiring** — peer isn't receiving keepalives; default 180s hold-down / 60s keepalive; flap at 90s suggests mid-cycle drop of keepalives, not a hold-down. Check MTU / BFD / CPU first.
2. **MTU black hole** — if updates require fragmentation over a tunnel/ISP path, TCP MSS clamping missing. Large UPDATEs can drop silently.
3. **BFD failing** — if BFD is configured, micro-loss on the path triggers BFD down which takes BGP down.
4. **Control-plane CoPP / policer** on the ASR rate-limiting BGP packets.
5. **Peer's side** — ISP maintenance, asymmetric routing at their end.

## Verification

### Hypothesis 1 — Keepalive / hold-time
\`\`\`
show bgp ipv4 unicast neighbors 203.0.113.5 | i Hold|Keepalive|Last
\`\`\`
Look at negotiated hold + keepalive. If peer is more aggressive (e.g., hold 9s), lots of flaps are likely.

### Hypothesis 2 — MTU
\`\`\`
ping 203.0.113.5 size 1460 df-bit
ping 203.0.113.5 size 9000 df-bit      ! only if your link supports jumbo
\`\`\`
Confirm standard 1500 works; then try 1472 (MTU - headers) to see where it breaks.

### Hypothesis 3 — BFD
\`\`\`
show bfd neighbors
show run | s bfd
\`\`\`

### Hypothesis 4 — CoPP / control-plane
\`\`\`
show policy-map control-plane in
show control-plane features
show platform hardware qfp active infrastructure punt statistics summary
\`\`\`

### Hypothesis 5 — Peer-side
- Ask the ISP: any flaps at their end, any maintenance, state transitions in their logs?

## Fix commands (once root cause known)

### If MTU black hole:
\`\`\`
interface GigabitEthernet0/0/0
  ip tcp adjust-mss 1360
\`\`\`
Or negotiate PMTUD more robustly — stop firewalls dropping ICMP Type 3 Code 4.

### If CoPP:
\`\`\`
class-map match-any BGP
 match access-group name BGP-ACL
policy-map CONTROL-PLANE-POLICY
 class BGP
  police 8000 conform-action transmit exceed-action drop
\`\`\`
Whitelist BGP source IPs rather than relying on default policer.

### If BFD triggering:
Relax BFD intervals (e.g., 300ms x 3 → 500ms x 5) or disable BFD if link is known lossy.

## Verification after fix
\`\`\`
show bgp ipv4 unicast neighbors 203.0.113.5 | i Uptime|BGP state
show bgp ipv4 unicast summary | i 203.0.113.5
\`\`\`
Expect `Established` with uptime monotonically increasing for > 30 min. No flap log entries.

## Prevention
- Always set explicit BGP timers on production neighbors.
- Enable BFD only when both sides support aggressive timers.
- TCP MSS clamp on eBGP interface where MTU is 1500.
- CoPP that explicitly permits peer IPs.
- Syslog the neighbor state transitions; alert on any non-expected transition.
```

## What I won't do

- I won't recommend `clear ip bgp *` as a first action. That fixes flap symptoms but loses evidence.
- I won't suggest disabling route validation (RPKI, bogon filters) to "make it work".
- I won't claim a diagnosis without the supporting output or log I asked for.

## Reference

- RFC 4271 — BGP-4
- RFC 7454 — BGP Operations and Security
- MANRS (Mutually Agreed Norms for Routing Security) practices
- Cisco BGP troubleshooting guide, Juniper Day One: Deploying BGP

## Attribution

Built by **Hak** at **VantagePoint Networks**. MIT licensed.
