Skip to content

Runbook: Grafana CrashLoopBackOff

Symptom

kube-prometheus-stack-grafana-* pod shows 2/3 Running, CrashLoopBackOff. The two healthy containers are the grafana-sc-dashboard and grafana-sc-datasources k8s-sidecar containers. The crashing one is grafana itself.

Triage

POD=$(kubectl -n monitoring get pod -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}')

# Why is the grafana container exiting?
kubectl -n monitoring logs $POD -c grafana --previous --tail=200

# Pod-level events:
kubectl -n monitoring describe pod $POD | tail -40

Read the previous container logs, because the live container is in backoff and has no logs of its own.

Common patterns:

Log signature Likely cause See
pq: no pg_hba.conf entry for host ... Postgres rejecting the new source IP pg_hba runbook
dial tcp <postgres-ip>:5432: i/o timeout Network reachability, firewall, or Postgres down check nc -vz <ip> 5432, ssh and systemctl status postgresql@16-main
Error parsing config file ... grafana.ini ConfigMap rolled out by Reloader has a typo kubectl -n monitoring describe cm kube-prometheus-stack-grafana and revert in Git
failed to load plugin GF_INSTALL_PLUGINS references a plugin that is no longer compatible with the Grafana image version pin the plugin or upgrade Grafana

Fix

Once the underlying cause is fixed, kick the pod to skip the backoff window:

kubectl delete pod -n monitoring \
  -l app.kubernetes.io/name=grafana
kubectl wait pod -n monitoring \
  -l app.kubernetes.io/name=grafana --for=condition=Ready --timeout=120s

Note on the ConfigMap being ArgoCD-managed

kube-prometheus-stack-grafana is reconciled by ArgoCD. Editing the ConfigMap in-cluster will be reverted on next sync. If the fix is a config change, do it in the Helm values in Git and let ArgoCD apply it. Direct kubectl edit is fine only as a temporary mitigation, and you should pause the ArgoCD app before doing so:

argocd app set kube-prometheus-stack --sync-policy none
# ... edit ...
# put the same change in Git, then:
argocd app set kube-prometheus-stack --sync-policy automated
argocd app sync kube-prometheus-stack

Reloader interaction

The pod has reloader.stakater.com/auto: "true". Any change to the ConfigMap or the linked Secret (grafana-secrets, kube-prometheus-stack-grafana) will trigger a rollout. If Grafana started crashing right after a quiet ConfigMap or Secret change, that is your trigger window.