kubectl¶
Find broken things first¶
# Pods that are not Running:
kubectl get pod -A --field-selector=status.phase!=Running
# Pods where not all containers are Ready (ready != desired):
kubectl get pod -A -o json \
| jq -r '.items[] | select(.status.containerStatuses != null)
| select([.status.containerStatuses[].ready] | all | not)
| "\(.metadata.namespace)/\(.metadata.name) \(.status.phase)"'
# Recent warnings cluster-wide, newest last:
kubectl get events -A --sort-by=.lastTimestamp \
| grep -v Normal | tail -40
# All LoadBalancer Services and their assigned IPs:
kubectl get svc -A --field-selector spec.type=LoadBalancer \
-o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,EXTIP:.status.loadBalancer.ingress[0].ip
Logs (the most-used commands in any incident)¶
# Live tail
kubectl logs -n <ns> <pod> -c <container> -f
# Previous container, after a crash. This is where the actual error usually is.
kubectl logs -n <ns> <pod> -c <container> --previous --tail=200
# All containers in a pod, prefixed:
kubectl logs -n <ns> <pod> --all-containers --prefix --tail=200
# Filter the chatty health-probe noise:
kubectl logs -n <ns> <pod> | grep -vE "GET /(healthz|readyz|metrics)"
Watch readiness without polling by hand¶
kubectl wait is purpose-built. Use it inside scripts.
kubectl wait pod -n <ns> -l <selector> --for=condition=Ready --timeout=120s
# Or if you want the pod-level Phase:
kubectl wait pod -n <ns> -l <selector> --for=jsonpath='{.status.phase}'=Running --timeout=120s
For "until both of these are Ready," loop on jsonpath:
until \
[ "$(kubectl get pod -n monitoring -l app.kubernetes.io/name=grafana \
-o jsonpath='{.items[0].status.containerStatuses[?(@.name=="grafana")].ready}')" = "true" ] \
&& \
[ "$(kubectl get pod -n automation -l app=claude-bridge \
-o jsonpath='{.items[0].status.containerStatuses[0].ready}')" = "true" ]; do
sleep 4
done
echo READY
Inspect a resource quickly¶
# YAML for any resource (works for CRDs):
kubectl get <kind> -n <ns> <name> -o yaml
# Just the status:
kubectl get pod -n <ns> <name> -o jsonpath='{.status.conditions}' | jq .
# Just the container ready states:
kubectl get pod -n <ns> <name> -o jsonpath='{range .status.containerStatuses[*]}{.name}{": ready="}{.ready}{" restarts="}{.restartCount}{"\n"}{end}'
# Custom columns for a list view:
kubectl get pod -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,NODE:.spec.nodeName,IP:.status.podIP
Restart things¶
# Roll a Deployment / DaemonSet / StatefulSet without changing the spec:
kubectl rollout restart deployment/<name> -n <ns>
# Skip CrashLoopBackOff backoff for a single pod:
kubectl delete pod -n <ns> <pod> # the controller recreates it immediately
# Drain a node before maintenance:
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60
# ... do work ...
kubectl uncordon <node>
CRDs and "I forgot the API group"¶
# What kinds exist?
kubectl api-resources | grep -i <substring>
# Schema for a kind, deeply:
kubectl explain ipaddresspool.spec --api-version=metallb.io/v1beta1 --recursive
kubeconfig and contexts¶
kubectl config get-contexts
kubectl config use-context <name>
kubectl config view --minify --raw # current context only, with secrets
etcd operations on a kubeadm cluster¶
Run from a control-plane node. The cert paths assume the kubeadm default; adjust if your setup is different.
ETCD_ENDPOINTS=https://192.168.1.89:2379,https://192.168.1.90:2379,https://192.168.1.91:2379
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key"
# What's the on-disk size of each member's DB?
sudo ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINTS $ETCD_CERTS \
endpoint status -w table
# Compaction: discards historical revisions. Logical, cluster-wide, fast.
REV=$(sudo ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINTS $ETCD_CERTS \
endpoint status -w json | jq '.[0].Status.header.revision')
sudo ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINTS $ETCD_CERTS compact $REV
# Defragmentation: rewrites the .db file to reclaim disk. Per-member.
# Run one at a time, wait for endpoint health between members.
for ep in https://192.168.1.89:2379 https://192.168.1.90:2379 https://192.168.1.91:2379; do
sudo ETCDCTL_API=3 etcdctl --endpoints=$ep $ETCD_CERTS defrag
sudo ETCDCTL_API=3 etcdctl --endpoints=$ep $ETCD_CERTS endpoint health
done
# Alarms (raised on NOSPACE, etc.)
sudo ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINTS $ETCD_CERTS alarm list
sudo ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINTS $ETCD_CERTS alarm disarm
Compaction and defragmentation are different operations with confusingly similar names. Compaction reduces logical size; defragmentation reclaims disk. You need both. See the etcd-defrag runbook.
ExternalSecrets / secrets management¶
# Force a single ExternalSecret to re-sync immediately:
kubectl annotate externalsecret <name> -n <ns> \
force-sync=$(date +%s) --overwrite
# Show last sync status across all ExternalSecrets:
kubectl get externalsecret -A \
-o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,STATUS:.status.conditions[0].reason,SYNCED:.status.conditions[0].status,LASTSYNC:.status.refreshTime
# Scale ESO down (stops the retry loop during a rate-limit incident):
kubectl -n external-secrets scale deploy external-secrets --replicas=0
kubectl -n external-secrets scale deploy external-secrets-cert-controller --replicas=0
kubectl -n external-secrets scale deploy external-secrets-webhook --replicas=0
# Bring it back:
kubectl -n external-secrets scale deploy external-secrets --replicas=1
kubectl -n external-secrets scale deploy external-secrets-cert-controller --replicas=1
kubectl -n external-secrets scale deploy external-secrets-webhook --replicas=1
# What is this Secret actually storing? (decoded values)
kubectl get secret -n <ns> <name> -o json \
| jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'
Find pods by restart count¶
A leading indicator of a crashloop or a flaky dependency. Often the first signal of underlying control-plane instability (etcd bloat, scheduler problems).
# Top 20 pods cluster-wide by restart count of the first container:
kubectl get pod -A --sort-by=.status.containerStatuses[0].restartCount \
-o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount,STATUS:.status.phase \
| tail -20
# Anything that has restarted more than 5 times in any container:
kubectl get pod -A -o json \
| jq -r '.items[] | select(.status.containerStatuses != null)
| select([.status.containerStatuses[].restartCount] | max > 5)
| "\(.metadata.namespace)/\(.metadata.name) restarts=\([.status.containerStatuses[].restartCount] | max)"'
Things I reach for during an incident¶
Grouped by question.
| Question | Command |
|---|---|
| Why is this pod not Ready? | kubectl describe pod -n <ns> <pod> \| tail -40 |
| What did the previous container say before crashing? | kubectl logs -n <ns> <pod> -c <c> --previous --tail=200 |
| Has anything restarted lately? | kubectl get pod -A --sort-by=.status.containerStatuses[0].restartCount \| tail |
| Does this MetalLB IP have a Ready endpoint? | kubectl get endpointslices -A \| grep <svc> |
| What does the speaker think about this IP? | kubectl logs -n metallb-system -l component=speaker --tail=400 \| grep -E "<svc>\|<ip>\|notOwner" |
| Did something just rollout? | kubectl get events -A --sort-by=.lastTimestamp \| tail -40 |
| Is this resource ArgoCD-managed? | kubectl get <kind> <name> -n <ns> -o jsonpath='{.metadata.annotations.argocd\.argoproj\.io/tracking-id}' |
Patterns worth knowing¶
- Always read
--previouslogs after a crash. The live container is in backoff and has nothing to say. - Pod conditions over
phase.phase=Runningdoes not mean the pod is healthy; checkReadyandContainersReady. kubectl waitis better than sleep. Avoids race conditions and times out cleanly.-o jsonpathis faster than piping-o yamlinto grep. It also does not break when YAML formatting changes.