Skip to content

Incidents

Real failures, walked from symptom to fix. These are the most useful pages on this site, both for me (so I do not redo the work) and for anyone evaluating how I think about systems.

Format for every incident:

  1. Symptom as I first noticed it.
  2. First hypothesis and what made me revise it.
  3. Real root cause.
  4. Blast radius, including downstream effects that looked like separate issues.
  5. Fix, including what I did not change and why.
  6. Follow-ups that survived the incident.

Index, newest first