# Certificate & TLS Lifecycle Runbook

> Working runbook for managing the TLS certificate lifecycle across an estate: inventory, issuance, renewal, rotation, revocation, monitoring, and emergency replacement. Production-tested. MIT licensed.

**Assumption:** you run a mix of public (ACME / Let's Encrypt), private CA-issued, and device-level certificates across cloud + on-prem. If you are a pure "everything is Let's Encrypt" shop this is over-engineered; scale back sections.

---

## 0. Why certificates bite you

Every major outage database has certificate expiry in the top five recurring root causes. The pattern is always the same:

1. Someone manually renews a cert three years ago.
2. The renewal calendar invite dies with the person.
3. Nobody reviews the cert inventory because there is no cert inventory.
4. 01:30 on a Saturday, monitoring goes red.

This runbook stops that loop. The one-line principle: **every certificate has a named owner, an automated renewal path, and a monitored expiry**. No exceptions.

---

## 1. Inventory

### 1.1 What goes in the inventory

| Field | Example |
|---|---|
| Common name / SANs | `api.example.com, *.api.example.com` |
| Issuer | Let's Encrypt R3 |
| Key algorithm + size | RSA 2048 / ECDSA P-256 |
| Serial | `03:a1:2b:...` |
| Not-before | 2026-02-10 |
| Not-after | 2026-05-10 |
| Deployment locations | `lb-edge-01`, `lb-edge-02`, `cdn-origin-01` |
| Automation | `cert-manager/production-issuer` / `acme.sh` / manual |
| Owner (team) | Platform |
| Owner (individual) | `@oncall-platform` |
| Linked ticket/project | `NET-1234` |
| Tier | Production / staging / internal / IoT |
| Notes | e.g. "Pinned by mobile app up to version 4.2" |

### 1.2 How to populate it initially

- **Public-facing certs:** scrape Certificate Transparency logs for your domain (crt.sh, censys.io). Every SAN you've ever issued is searchable.
- **Private CA certs:** export from the issuing CA. Don't trust anyone's memory.
- **Load balancer / device certs:** walk each LB, firewall, VPN, mail gateway, print server, IoT hub. Use a scanner (see `/cert-expiry-audit` skill in this library).
- **Client certs:** harder. Start with the identities enrolled in your IdP / MDM.
- **Internal application certs:** pull from Kubernetes secrets, HashiCorp Vault, AWS ACM, Azure Key Vault, GCP Certificate Manager.

If you find more than 20% of certs are manually managed, the inventory *is* the runbook item.

---

## 2. Issuance policy

### 2.1 Public-facing

- **ACME (Let's Encrypt / ZeroSSL) by default** for anything internet-facing. Free, 90-day rotation forces automation.
- **Wildcard or SAN?** SAN is safer. Wildcards leak scope and are painful to revoke.
- **Key type:** ECDSA P-256 for new issuance. RSA 2048 only where legacy clients require it. Never RSA 1024.
- **Key usage:** serverAuth only. Don't issue clientAuth + serverAuth from the same cert.
- **EV certs:** don't bother. Browsers stopped showing the green bar in 2019.

### 2.2 Private / internal

- **Use a private CA** (AWS Private CA, HashiCorp Vault PKI, Smallstep, step-ca, Azure Private CA). Don't self-sign in production.
- **Short lifetimes** (7–90 days) with automated renewal. Yearly issuance is a smell.
- **Intermediate hierarchy:** root kept offline in HSM; issuing intermediates online.
- **CRL + OCSP published** at stable, monitored endpoints.

### 2.3 Client certs (mutual TLS)

- Separate issuing CA from server-auth CA.
- Tie to identity (user, device, workload). Workload identity federation where cloud-native.
- Short lifetime (hours for workloads, days for user devices).

### 2.4 Approved issuers

Document which CAs are approved for your domain. Add a CAA record:

```
example.com. CAA 0 issue "letsencrypt.org"
example.com. CAA 0 issue "amazon.com"
example.com. CAA 0 issuewild "letsencrypt.org"
example.com. CAA 0 iodef "mailto:security@example.com"
```

CAA doesn't stop MITM but it stops mis-issuance by a compromised CA.

---

## 3. Renewal

### 3.1 Automated renewal priority

| Platform | Mechanism |
|---|---|
| Kubernetes workloads | cert-manager + Issuer / ClusterIssuer |
| AWS workloads | ACM auto-renewal (if ACM-issued) |
| Azure workloads | Key Vault + App Service managed certs OR Certificate Manager |
| GCP workloads | Certificate Manager / managed certs |
| Load balancers (F5, Citrix) | Venafi / Keyfactor / ACM-integrations |
| Firewalls | Vendor-specific; some support ACME directly (Fortinet 7.4+, Palo Alto 11+) |
| Windows servers | Win-ACME or Certify the Web |
| Legacy devices | Wrap with a renewable front-end proxy where possible |

### 3.2 Renewal windows

- **Non-production:** renew at 30 days before expiry.
- **Production:** renew at 45 days before expiry.
- **Pinned mobile / IoT:** renew at 90+ days and coordinate with a client rollout.

Never renew at the last possible moment. The first renewal of the cycle should succeed; only then does the cron job escalate to the next target.

### 3.3 Manual renewal SLA

If a cert *must* be manually renewed:

- Ticket raised 60 days before expiry.
- Named engineer assigned.
- Change ticket links to CA, deployment location, rollback plan.
- Second engineer verifies post-deployment.
- Calendar reminder set for next renewal (in the inventory, not in someone's personal calendar).

---

## 4. Rotation + deployment

### 4.1 Atomic rotation

Bad: new cert written to file, service not reloaded. First client with a persistent connection still using old cert; new TLS handshakes split between old and new for hours.

Good:
- Write new cert + key into place.
- Verify via `openssl s_client -servername foo -connect host:443 </dev/null 2>/dev/null | openssl x509 -noout -dates`.
- Trigger service reload (SIGHUP for nginx, `systemctl reload` for most).
- Monitor TLS handshake success rate for 15 minutes.
- Confirm old cert no longer in use (sample connection fingerprints if the LB supports it).

### 4.2 Multi-region / load-balanced fleets

- Stage the cert in the non-primary region first.
- Move 10% of traffic via DNS or LB weighting.
- Watch for 15 minutes.
- Promote to full traffic.
- Apply the cert to the remaining fleet.

### 4.3 Pinning (public key / certificate)

- **HPKP** is dead. Don't use it.
- **Mobile app pinning** is common and dangerous. Document which apps pin what, and at which key level (leaf vs intermediate vs root).
- **Rotation plan** for pinned certs: new pin shipped in a backward-compatible app release, then the cert rotates, then the old pin is removed in the next release. Minimum 90-day overlap.

---

## 5. Revocation

### 5.1 When to revoke

- Private key compromise or suspected compromise.
- Certificate mis-issuance (to wrong name, wrong org).
- Service no longer in operation and cert will not be re-used.
- Forced by CA / policy / regulation.

### 5.2 How to revoke

- **ACME certs:** `certbot revoke --cert-path fullchain.pem`
- **Private CA:** use the CA's revoke command; publish updated CRL; ensure OCSP responder knows.
- **ACM:** delete the certificate (ACM handles revocation internally).
- **Broadly distributed CAs:** also notify the CA via their abuse contact if compromise is suspected at the CA level.

### 5.3 After revoke

- Rotate key material everywhere the cert was deployed. Revoke alone without replacing key material is half a job.
- Audit logs to find any leak path.
- Post-mortem if revocation was triggered by compromise.

---

## 6. Monitoring

### 6.1 What to monitor

- **Expiry** — 90, 30, 7, 1 day alerts.
- **Chain validity** — intermediate still trusted, signer still valid.
- **Key strength** — alert on any RSA < 2048 or ECDSA < P-256.
- **Cipher suite** — weak suites detected (NULL, EXPORT, RC4, 3DES).
- **Protocol version** — TLS < 1.2 used anywhere.
- **Handshake success rate** — drop indicates something broken.
- **Certificate Transparency watch** — alert on unexpected issuance for your domain.
- **CAA drift** — alert if CAA record changes or is removed.

### 6.2 Tooling

- **SSL Labs API** for external grade tracking.
- **`testssl.sh`** for scheduled internal scans.
- **`openssl s_client`** in simple shell scripts for edge cases.
- **Prometheus blackbox exporter** with `tls` module for chain + expiry.
- **Grafana dashboard** with cert expiry panel sourced from the inventory.
- **crt.sh / Facebook CT monitor** for CT log subscriptions.

---

## 7. Emergency replacement (fire)

Certificate expired in production. Runbook:

### 7.1 Immediate (first 15 min)

1. **Confirm** — is it really expired or is it a client clock issue / chain problem?
2. **Declare** SEV1 if customer-facing. Open incident channel.
3. **Identify owner** from inventory.
4. **Issue new cert** via the fastest available path:
   - ACME: `certbot renew --force-renewal`
   - ACM: request a new cert, bind to LB
   - Private CA: issue via CA tool
5. **Deploy** — rotate into place, reload service.

### 7.2 Verification (15–30 min)

- `curl -v https://your-endpoint` returns 200 and valid cert
- `echo | openssl s_client -servername foo -connect host:443 2>/dev/null | openssl x509 -noout -enddate` shows future date
- TLS handshake success rate back above 99.9%
- Real customer traffic recovering (LB / CDN metrics)

### 7.3 After (within 24h)

- Post-mortem.
- Why did monitoring miss the 30-day warning? If it didn't, why did nobody act? (Usually the answer.)
- Add the expired cert's deployment locations to the inventory if missing.
- Set next renewal to 45+ days before new expiry.

---

## 8. CA hygiene

### 8.1 Trusted root store

- Audit trusted roots on every managed device. Remove unused / obsolete roots.
- Pin internal trust to your private CA only; use OS trust store for public CAs.
- Don't ship internal CAs inside public container images.

### 8.2 Root key ceremony (for your own CA)

- Offline machine, no network.
- HSM-backed where possible.
- Witnesses present and signing.
- Dual control for key use.
- Documented recovery procedure.
- Periodic rehearsal (annual).

### 8.3 Intermediate rotation

- Issue new intermediate before old one is due.
- Publish cross-signed chains where possible.
- Monitor for ~30 days before retiring old intermediate.
- Don't rotate intermediate the same week as a root (unless they're bundled).

---

## 9. Metrics to report monthly

| Metric | Target |
|---|---|
| Certs auto-renewed / total | > 95% |
| Certs with owner in inventory | 100% |
| Certs renewed more than 30 days before expiry | > 99% |
| Expired-in-production incidents | 0 |
| Mean time to replace a compromised cert | < 2 hours |
| Unique CAs used | trending down (simpler = safer) |
| Certs below modern crypto bar (RSA < 2048, TLS < 1.2) | 0 |
| Certs logged in CT | 100% for public, as policy for private |

---

## 10. Integration with other runbooks

- **Incident response:** a compromised cert is a SEV2 minimum. Tie into the incident runbook.
- **Change management:** cert rotations are changes. Treat emergency renewals as emergency changes with post-hoc CAB review.
- **Vendor management:** third-party services that use your cert (payment gateways, CDN) need notification lead time.
- **DR:** the DR site needs its own valid certs. Test during DR exercises (see `/dr-test-planner`).

---

## 11. Companion files in this library

- `/cert-expiry-audit` skill — paste a cert list, get a prioritised renewal backlog.
- `spreadsheets/certificate-expiry-tracker.csv` — ready-to-use inventory template (if available in your copy).

---

## Attribution

Built by **Hak** at **VantagePoint Networks**. Based on TLS ecosystem best practice + real cert incidents across SaaS, enterprise and regulated financial services. MIT licensed — fork, customise, ship.