2026-04-19: Bitnami public images quietly disappeared¶
Date: 2026-04-19
Severity: S3. Single Job stuck in ImagePullBackOff for ~67 minutes (284 retries). No user-visible impact, caught during an unrelated investigation.
Symptom¶
While poking at something else in apps/media/, I noticed a PostSync hook (jellyfin-endpoints-hook.yaml) had been failing image pulls for over an hour with:
rpc error: code = NotFound desc = failed to pull and unpack image
"docker.io/bitnami/kubectl:1.31": failed to resolve image:
docker.io/bitnami/kubectl:1.31: not found
Tag, registry, exact spelling all looked correct.
What happened¶
Bitnami quietly retired their public Docker Hub images sometime in 2025. Every classic bitnami/<image>:<tag> path now 404s on resolution. There is a continuation namespace at bitnamilegacy/<image>:<tag> that still resolves today, but Bitnami has flagged it as EOL-bound, so it is a kicking-the-can fix rather than a real one.
The hook itself was harmless: a Job that templated some Service Endpoints after Argo synced the Jellyfin app. With the image gone, the Job sat in retry, the post-sync wave never completed, but the running Jellyfin pod kept serving so nothing was visible to users. This is the worst kind of quiet failure: it would have made a future ArgoCD sync look stuck or misleadingly succeed.
Blast-radius audit¶
After fixing the immediate hook, I grepped the relevant repos for every Bitnami reference. The same retirement affects:
- Kubernetes Jobs, CronJobs, and initContainers across
apps/*andinfrastructure/*. - Ansible roles and
docker-composefiles inplaybooks/*androles/*that pullbitnami/*. - Helm chart
values.yamlthat defaultsimage.repositorytobitnami/*(this is the sneaky one; many community charts default to it and you only know if you read the chart). DockerfileFROM bitnami/*base images.
# k8s-argocd
grep -rn "image: bitnami" --include="*.yaml" .
grep -rn "bitnami/" --include="*.yaml" --include="Chart.*" .
# ansible-quasarlab
grep -rn "bitnami/" --include="*.yml" --include="*.yaml" --include="Dockerfile*" .
Only the one Job was affected in this case. Audit is now part of Runbook: Bitnami image 404.
Replacement policy¶
| Need | Pick |
|---|---|
kubectl with a shell (heredocs in Jobs) |
alpine/kubectl:<cluster-version> |
kubectl distroless (exec only, no shell) |
cgr.dev/chainguard/kubectl:latest |
| Postgres, Redis, MongoDB, RabbitMQ, etc. | Docker Hub official, Chainguard, or a vendor-maintained image |
Helm chart that defaults to bitnami/* |
Override image.repository in values.yaml |
Pin to a specific patch version, never latest, never just the minor. Trivy + Dependabot will then flag tag drift as a finding instead of letting it slide silently.
Fix¶
PR #143: bitnami/kubectl:1.31 → alpine/kubectl:1.33.4 in the Jellyfin endpoints hook. Matches the cluster minor exactly, has /bin/sh for the heredoc the Job uses, and is maintained by an active project.
Why this caught me¶
This was a silent rot failure: the resource never changed, the registry path was unchanged, but the world changed underneath. The Trivy + Dependabot setup I built for discord-alert-proxy (during the 2026-04-13 alerting cascade remediation) catches this class of problem automatically for images I build. Standalone image: references in third-party manifests do not get scanned unless I wire them in explicitly.
Follow-ups¶
- Replace the affected Bitnami pull in the Jellyfin hook (PR #143).
- Runbook for the next time this happens with a different image.
- CI step that walks every
image:reference acrossk8s-argocdand HEAD-requests its registry. Cheap, deterministic, fails loud on 404. Tracked. - Periodic audit of base images in
Dockerfilefiles in my own repos (discord-alert-proxy,claude-bridge,sky-explorer). Dependabot catches CVEs; it does not catch "vendor pulled the rug."
What this incident is a good example of¶
- "Free public base image from a big vendor" is not a permanent dependency. Treat them like transitive packages and audit them.
- Silent failures are worse than loud ones. A broken Job in retry is invisible until you happen to look. Add alerts on
kube_job_failedfor production-relevant Jobs, not just user-facing workloads. - Pin patch, not minor. A tag drift via
:latestresolution can land a CVE or, as here, a complete vendor change. Patch-pinning makes the scanner do its job.