Skip to content

Runbook: MetalLB IP not answering ARP

Symptom

A LoadBalancer IP in the 192.168.1.225-240 pool stops responding from the LAN. ping times out, curl cannot connect, but the underlying app has at least one Running pod and the Service still shows the EXTERNAL-IP.

Triage

Three things to check, in order:

# 1. Is there a Ready endpoint? "No ready endpoints" is the most common cause.
kubectl get endpointslices -A | grep <service-name>
kubectl get pod -n <ns> -l <selector> -o wide

# 2. What do the speakers say? Look for "notOwner" or "serviceWithdrawn".
for p in $(kubectl -n metallb-system get pod -l component=speaker -o name); do
  echo "=== $p ==="
  kubectl -n metallb-system logs --tail=400 $p \
    | grep -E "<service-name>|<lb-ip>|notOwner|Withdrawn|Announced"
done

# 3. ARP table from a host on the same L2 segment:
ip neigh | grep 192.168.1.<lb-ip>
arping -c 3 192.168.1.<lb-ip>   # if installed

Decision tree:

Finding Cause Fix
serviceWithdrawn ... reason=notOwner and no serviceAnnounced from any speaker No node has a Ready endpoint Fix readiness on the underlying pod
serviceAnnounced from a speaker but no ARP reply Network policy or firewall on the node, or the elected speaker pod is unhealthy Check speaker pod, host firewall, Calico policies
EXTERNAL-IP missing Pool exhausted or controller down kubectl get pod -n metallb-system and kubectl describe svc

Fix (the "no ready endpoints" case)

MetalLB itself does not need to be touched. Fix readiness on the underlying app and MetalLB will re-elect a speaker and re-announce the IP automatically.

kubectl describe pod -n <ns> <pod>          # readiness probe details, recent events
kubectl logs -n <ns> <pod> --previous       # if it has crashed

If the readiness regression turns out to be a backend-auth issue, see the pg_hba runbook.

Why this happens

MetalLB in L2 mode will only advertise an IP from a node that has at least one Ready endpoint cluster-wide. When the last endpoint flips to NotReady, the elected speaker logs serviceWithdrawn ... reason=notOwner and stops answering ARP for that IP. This is correct behavior: there is nothing healthy to send traffic to. Once any endpoint becomes Ready again, a speaker takes ownership and ARP starts working.

Quick check that MetalLB itself is fine

kubectl get ipaddresspool,l2advertisement -n metallb-system
kubectl get pod -n metallb-system -o wide      # 1 controller + N speakers, all Running
kubectl get svc -A --field-selector spec.type=LoadBalancer

If all three look right and the right speaker is logging serviceAnnounced for your IP, the problem is downstream of MetalLB.