Runbook: Postgres rejecting K8s pods¶
Symptom¶
App pods fail to connect to the external Postgres VM (192.168.1.123). Pod logs show:
The 192.168.1.X in the error is a K8s node IP, not a pod IP. Multiple apps that share the same Postgres backend lose readiness at roughly the same time.
Triage¶
Confirm it is the auth layer, not Postgres being down:
# Is Postgres reachable?
nc -vz 192.168.1.123 5432
# Same TCP probe with no extra tools:
timeout 3 bash -c 'cat < /dev/tcp/192.168.1.123/5432' ; echo "exit=$?"
# Postgres logs from the VM:
ssh 192.168.1.123 'sudo journalctl -u postgresql@16-main --since "30 min ago" | tail -50'
Look for no pg_hba.conf entry lines. The source IP in those lines is the IP the connection is arriving from at Postgres, after Calico SNAT.
Fix¶
Allow the K8s node IPs explicitly. Backup before editing.
ssh 192.168.1.123 "
TS=\$(date +%Y%m%d-%H%M%S)
sudo cp -a /etc/postgresql/16/main/pg_hba.conf \
/etc/postgresql/16/main/pg_hba.conf.bak-\${TS}
sudo tee -a /etc/postgresql/16/main/pg_hba.conf > /dev/null <<EOF
# K8s cluster nodes
host all all 192.168.1.89/32 scram-sha-256
host all all 192.168.1.90/32 scram-sha-256
host all all 192.168.1.91/32 scram-sha-256
EOF
sudo systemctl reload postgresql@16-main
"
Bounce affected pods so they retry instead of waiting for the CrashLoopBackOff window:
kubectl delete pod -n monitoring -l app.kubernetes.io/name=grafana
kubectl delete pod -n automation -l app=claude-bridge
Verify:
kubectl wait pod -n monitoring -l app.kubernetes.io/name=grafana --for=condition=Ready --timeout=120s
kubectl wait pod -n automation -l app=claude-bridge --for=condition=Ready --timeout=120s
Follow-up¶
- Persist in Ansible (
ansible-quasarlab/labctl-runs/postgres/) so a config run does not revert it. - Move toward
hostssl+ssl_mode=requirein app configs so the LAN hop is encrypted. Then change the rules above fromhosttohostssl. - Track what introduced the SNAT. If the old behavior had pod IPs reaching Postgres directly, a Calico upgrade or
natOutgoingchange is the most likely cause. Checkdpkg.logand Calico release notes for the upgrade window.
Why this happens at all¶
Calico masquerades pod traffic that egresses the cluster, so external hosts see the node IP as source. Allowing only the pod CIDR (10.244.0.0/16) in pg_hba.conf will not work, because by the time the packet reaches Postgres the source has already been rewritten.