Incidents¶
Real failures, walked from symptom to fix. These are the most useful pages on this site, both for me (so I do not redo the work) and for anyone evaluating how I think about systems.
Format for every incident:
- Symptom as I first noticed it.
- First hypothesis and what made me revise it.
- Real root cause.
- Blast radius, including downstream effects that looked like separate issues.
- Fix, including what I did not change and why.
- Follow-ups that survived the incident.
Index, newest first¶
- 2026-05-03 Grafana down + MetalLB withdrawing IPs (one issue, not two)
- 2026-05-02 1Password rate-limit recurrence, the dynamic-inventory bypass
- 2026-04-19 Bitnami public images quietly disappeared
- 2026-04-18 1Password daily rate-limit, the per-account bucket
- 2026-04-18 etcd bloat, control-plane instability
- 2026-04-13 PVE self-fence, 9-hour alerting blackout, full remediation
- 2026-04-07 Grafana SQLite to Postgres migration, bigint vs boolean