Runbook: Grafana CrashLoopBackOff¶
Symptom¶
kube-prometheus-stack-grafana-* pod shows 2/3 Running, CrashLoopBackOff. The two healthy containers are the grafana-sc-dashboard and grafana-sc-datasources k8s-sidecar containers. The crashing one is grafana itself.
Triage¶
POD=$(kubectl -n monitoring get pod -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}')
# Why is the grafana container exiting?
kubectl -n monitoring logs $POD -c grafana --previous --tail=200
# Pod-level events:
kubectl -n monitoring describe pod $POD | tail -40
Read the previous container logs, because the live container is in backoff and has no logs of its own.
Common patterns:
| Log signature | Likely cause | See |
|---|---|---|
pq: no pg_hba.conf entry for host ... |
Postgres rejecting the new source IP | pg_hba runbook |
dial tcp <postgres-ip>:5432: i/o timeout |
Network reachability, firewall, or Postgres down | check nc -vz <ip> 5432, ssh and systemctl status postgresql@16-main |
Error parsing config file ... grafana.ini |
ConfigMap rolled out by Reloader has a typo | kubectl -n monitoring describe cm kube-prometheus-stack-grafana and revert in Git |
failed to load plugin |
GF_INSTALL_PLUGINS references a plugin that is no longer compatible with the Grafana image version |
pin the plugin or upgrade Grafana |
Fix¶
Once the underlying cause is fixed, kick the pod to skip the backoff window:
kubectl delete pod -n monitoring \
-l app.kubernetes.io/name=grafana
kubectl wait pod -n monitoring \
-l app.kubernetes.io/name=grafana --for=condition=Ready --timeout=120s
Note on the ConfigMap being ArgoCD-managed¶
kube-prometheus-stack-grafana is reconciled by ArgoCD. Editing the ConfigMap in-cluster will be reverted on next sync. If the fix is a config change, do it in the Helm values in Git and let ArgoCD apply it. Direct kubectl edit is fine only as a temporary mitigation, and you should pause the ArgoCD app before doing so:
argocd app set kube-prometheus-stack --sync-policy none
# ... edit ...
# put the same change in Git, then:
argocd app set kube-prometheus-stack --sync-policy automated
argocd app sync kube-prometheus-stack
Reloader interaction¶
The pod has reloader.stakater.com/auto: "true". Any change to the ConfigMap or the linked Secret (grafana-secrets, kube-prometheus-stack-grafana) will trigger a rollout. If Grafana started crashing right after a quiet ConfigMap or Secret change, that is your trigger window.