Runbook: MetalLB IP not answering ARP¶

Symptom¶

A LoadBalancer IP in the 192.168.1.225-240 pool stops responding from the LAN. ping times out, curl cannot connect, but the underlying app has at least one Running pod and the Service still shows the EXTERNAL-IP.

Triage¶

Three things to check, in order:

# 1. Is there a Ready endpoint? "No ready endpoints" is the most common cause.
kubectl get endpointslices -A | grep <service-name>
kubectl get pod -n <ns> -l <selector> -o wide

# 2. What do the speakers say? Look for "notOwner" or "serviceWithdrawn".
for p in $(kubectl -n metallb-system get pod -l component=speaker -o name); do
  echo "=== $p ==="
  kubectl -n metallb-system logs --tail=400 $p \
    | grep -E "<service-name>|<lb-ip>|notOwner|Withdrawn|Announced"
done

# 3. ARP table from a host on the same L2 segment:
ip neigh | grep 192.168.1.<lb-ip>
arping -c 3 192.168.1.<lb-ip>   # if installed

Decision tree:

Finding	Cause	Fix
`serviceWithdrawn ... reason=notOwner` and no `serviceAnnounced` from any speaker	No node has a Ready endpoint	Fix readiness on the underlying pod
`serviceAnnounced` from a speaker but no ARP reply	Network policy or firewall on the node, or the elected speaker pod is unhealthy	Check speaker pod, host firewall, Calico policies
EXTERNAL-IP missing	Pool exhausted or controller down	`kubectl get pod -n metallb-system` and `kubectl describe svc`

Fix (the "no ready endpoints" case)¶

MetalLB itself does not need to be touched. Fix readiness on the underlying app and MetalLB will re-elect a speaker and re-announce the IP automatically.

kubectl describe pod -n <ns> <pod>          # readiness probe details, recent events
kubectl logs -n <ns> <pod> --previous       # if it has crashed

If the readiness regression turns out to be a backend-auth issue, see the pg_hba runbook.

Why this happens¶

MetalLB in L2 mode will only advertise an IP from a node that has at least one Ready endpoint cluster-wide. When the last endpoint flips to NotReady, the elected speaker logs serviceWithdrawn ... reason=notOwner and stops answering ARP for that IP. This is correct behavior: there is nothing healthy to send traffic to. Once any endpoint becomes Ready again, a speaker takes ownership and ARP starts working.

Quick check that MetalLB itself is fine¶

kubectl get ipaddresspool,l2advertisement -n metallb-system
kubectl get pod -n metallb-system -o wide      # 1 controller + N speakers, all Running
kubectl get svc -A --field-selector spec.type=LoadBalancer

If all three look right and the right speaker is logging serviceAnnounced for your IP, the problem is downstream of MetalLB.