2026-04-13: PVE self-fence, 9-hour alerting blackout, full remediation¶
Date: 2026-04-13, 08:55 PDT Time-to-detect: ~9 hours (no alerts; noticed by user during normal use) Time-to-mitigate: Same day after detection Time-to-remediate (the alerting chain): 2026-04-18 (5 days of follow-up work, 8 PRs) Severity: S1. Lab service degraded for hours, alert pipeline silently down for the entire window.
This one is the foundation incident for a lot of the structural work the lab has now. It exposed three different single points of failure in the pipeline that is supposed to tell me when things are broken, and remediating those took the rest of the week.
Symptom¶
Around 17:30 PDT, while working on something else, I noticed the trading dashboard was unreachable. A kubectl get pod -A showed multiple pods in Pending, all on nodes whose kubelets were not reporting. pvecm status from the surviving Proxmox node showed only one peer.
pve (one of the two Proxmox hosts) had self-fenced at 08:55:04. Seven of eight onboot=1 VMs failed to start back up after the reboot. Several K8s nodes were among them. The cluster ran on whatever was left for nine hours, no alert ever reached me.
Three independent failures that compounded¶
Failure 1: Watchdog routed to null¶
The Alertmanager Watchdog alert is the deadman: it fires constantly and is meant to be received by an external service that pages when it stops arriving. The config had receiver: null for it. So the alert fired, was discarded, and there was no external observer to notice the silence. The whole point of a deadman switch was missing.
Failure 2: Single-replica discord-alert-proxy on a downed node¶
The Discord notification proxy was a 1-replica Deployment. The replica happened to be running on k8cluster1, which was on the downed PVE host. With no replica running, every Alertmanager send failed. There was no podAntiAffinity, no PDB, and no priorityClassName, so when the node came back the controller had no urgency to schedule it.
Failure 3: The proxy installed dependencies at runtime¶
The proxy deployment ran a base python:3.12-slim image and pip install-ed fastapi, uvicorn, httpx from an inline ConfigMap entrypoint. So even when the pod did eventually start, it depended on PyPI being reachable from inside the cluster at startup. Effectively: a monitoring component whose recovery depended on outbound internet from the lab. PyPI being available was something I never checked because nothing in the alert path advertised it as a dependency.
Root cause of the original PVE outage¶
pve-guests.service (the Proxmox systemd unit that starts onboot VMs) fires ~13 seconds after the kernel hands off. The freenas-proxmox plugin I use to resolve iSCSI extent paths calls the TrueNAS REST API at https://10.10.12.2/api/v2.0/system/info. If TrueNAS's API daemon is not yet serving when this call happens, the plugin fails fast with no retry, the systemd unit declares failure for the affected VMs, and they sit in a stopped state.
After a cold reboot of pve, TrueNAS was still booting and not yet serving the API. The plugin failed for every VM that depended on iSCSI extents (the K8s nodes among them). pve-guests gave up. pve was up but the VMs that mattered were not.
Remediation across the cluster (8 PRs, 2026-04-13 to 2026-04-18)¶
Alerting chain¶
- Watchdog → Healthchecks.io (PR #115). The
Watchdogalert is now routed to a Healthchecks.io check via an ExternalSecret-supplied URL. Healthchecks.io emails (and SMS, in this config) when it stops getting pings. External observer of internal silence, finally. discord-alert-proxymade highly available (PRs #117, #118). Two replicas, soft podAntiAffinity, PodDisruptionBudget withminAvailable: 1,priorityClassName: system-cluster-critical. The pod that pages me cannot be the pod most likely to be evicted.- Image baked, dependencies pinned (PRs #119, #120, #121, #122). Source moved into the repo at
infrastructure/monitoring/discord-alert-proxy/. Multi-stage Dockerfile, non-root UID 10001, requirements pinned. Built in CI, pushed toghcr.io/mithr4ndir/discord-alert-proxy, deployment refers to the image by 12-character SHA tag, notlatest. Dependabot covers pip + docker + github-actions ecosystems. Weekly scheduled rebuild keeps the base image fresh; Trivy gates the build on HIGH/CRITICAL CVEs. - PveGuestDown alert improved (PR #117). Tag-based silencing via an anchored
(.*;)?(no-alert|maintenance)(;.*)?regex on the VM tag string, plus aPveGuestMaintenanceTagStalesafety alert at 24 hours so a tag I forgot to remove eventually pages.
Proxmox-side¶
wait-truenas-api.shExecStartPre wired into pve-guests.service via drop-in (roles/pve/freenas_iscsi). Polls the TrueNAS REST endpoint every 2 seconds for up to 300 seconds. Accepts 2xx, 401, and 403 (the API daemon being up but auth-rejecting is "ready"). Rejects 5xx (daemon up but malfunctioning).TimeoutStartSec=infinitypreserves the upstream pve-guests timeout. Ansible deploys the script and the drop-in.- QDevice NSS database rebuilt by hand: corosync-qdevice on pve had had a corrupted
nssdbsince the 04-13 reboot, so the cluster had been running at 2/3 votes for a week and I hadn't noticed. Copied the working nssdb from pve2. - Both PVE nodes upgraded 8.4.x to 8.4.18.
- SSH host-key policy in
ansible.cfgset toaccept-newso brand-new hosts can be reached without manualssh-keyscan, but a host-key change on a known host still refuses (TOFU, not blanket trust). update-metrics.timerruns every 10 minutes sonode_reboot_requiredgauge stays fresh; was stale after reboots before this change.
Validation¶
Real reboots of both pve and pve2 on 2026-04-14. The TrueNAS-readiness wait logged "reachable (HTTP 401) after 9 attempts" in both cases. Every onboot VM started cleanly. ~10 min recovery on pve, ~18 min on pve2. The whole pipeline was tested, including the Healthchecks.io deadman: I let the cluster idle for 11 minutes (longer than the HC.io alert threshold) to confirm it would page if Watchdog stopped, then re-armed.
What this incident is really about¶
It is easy to treat this as "PVE had a bad reboot." The real lesson is structural: don't host the alerter on the thing it monitors. Each of the three failures was a different instance of the same bug.
- The Watchdog routed to
nullmeant the alerter was hosted on its own honesty: I trusted the config to be right, with no external observer. - The single-replica proxy meant the alerter was hosted on the cluster it monitors. Cluster down, alerter down.
- The runtime
pip installmeant the alerter was hosted on PyPI's availability at the worst possible time.
After the remediation, every leg of the alert path has an external observer or external pin:
- Watchdog has an external observer (Healthchecks.io).
discord-alert-proxyhas external pinning (image baked at GHCR, deployed by SHA).discord-alert-proxyhas internal redundancy (HA, PDB, priority).
There is still one node left to chase: the alert pipeline still runs inside the K8s cluster, so a full cluster-down event will still go silent. That is the active question behind "public read-only Grafana" and the off-site Uptime Kuma decision.
Follow-ups¶
- Alert chain hardened (8 PRs above).
- Boot-race fixed (
wait-truenas-api.sh). - QDevice NSS rebuilt; PVE upgrades applied.
- Reboot-required gauge timer refresh.
- Off-site external probe (Uptime Kuma off-site) so a full cluster-down still pages. Tracked under decisions/index.md.
- CI step that walks every
image:reference and HEAD-requests the registry, so a future "supply chain rot" failure does not need to be discovered by accident again. Same idea, different scope.
What this incident is a good example of¶
- A symptom-vs-cause split. "Trading dashboard is down" looked like an app failure; the real failure was the alerting chain that should have told me 9 hours earlier.
- Three independent SPOFs in the same path. Fixing only the loudest one would have left the other two waiting to re-fire.
- Choosing structural fixes (HA, deadman, baked image) over tweak fixes (bump retry count, increase replica memory). The fix list looks heavier upfront but doesn't require the same diagnosis a year later.