Prometheus and amtool¶
promtool: validate before pushing¶
promtool check config /etc/prometheus/prometheus.yml
promtool check rules /etc/prometheus/rules/*.yml
promtool query instant http://192.168.1.230:9090 'up{job="kubelet"} == 0'
promtool query range http://192.168.1.230:9090 \
--start=$(date -d '1 hour ago' -Iseconds) --end=$(date -Iseconds) --step=60 \
'rate(node_cpu_seconds_total{mode="idle"}[1m])'
# amtool also has a check-config:
amtool check-config /etc/alertmanager/alertmanager.yml
The CI runs promtool check rules on every PR that touches monitoring/. Catches typos and bad PromQL before they reach the cluster.
PromQL idioms I keep needing¶
# Targets that are down:
up == 0
# Pods that have restarted in the last 15 minutes:
increase(kube_pod_container_status_restarts_total[15m]) > 0
# Top 10 namespaces by CPU usage:
topk(10, sum by (namespace) (rate(container_cpu_usage_seconds_total{namespace!=""}[5m])))
# Alert on saturation: disk filling fast (will be full in <4h at current rate):
(predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0)
and node_filesystem_avail_bytes > 0
# Memory pressure approaching limit, per pod:
(container_memory_working_set_bytes{container!=""} /
on(pod, container) kube_pod_container_resource_limits{resource="memory"})
> 0.9
amtool: silence routine work¶
amtool --alertmanager.url=http://192.168.1.233:9093 alert query
# Silence a single alert for 2 hours:
amtool silence add alertname=KubeNodeNotReady instance=k8cluster1 \
--duration=2h --comment="planned maintenance"
# List active silences:
amtool silence query
# Expire one early:
amtool silence expire <silence-id>
Alertmanager routing recipe¶
The shape that has worked for me:
route:
group_by: [namespace, alertname]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: discord-default
routes:
- matchers: [severity = "critical"]
receiver: discord-ops
repeat_interval: 30m
- matchers: [namespace = "media"]
receiver: discord-media
- matchers: [alertname =~ "Falco.*"]
receiver: discord-security
The custom Discord proxy lives in monitoring/discord-alert-proxy and translates webhook payloads into channel-appropriate messages.
Textfile collector (export from a script)¶
When the data you want to alert on is not exposed by an existing exporter, write it to a file that node_exporter's textfile collector picks up. Used in the lab for: ETCD DB size from the weekly maintenance run, 1Password rate-limit gauges, ansible-playbook-changed counts, apt_upgrades_pending_total.
# Standard layout: one file per metric source, .prom extension.
TEXTFILE_DIR=/var/lib/node_exporter/textfiles
# Atomic write (rename is atomic, write-and-rename avoids partial reads
# during scrapes):
TMPFILE=$(mktemp -t metric.XXXXXX)
cat > "$TMPFILE" <<EOF
# HELP my_metric A description scraped by node_exporter.
# TYPE my_metric gauge
my_metric{label="foo"} 42
my_metric_last_run_timestamp_seconds $(date +%s)
EOF
chmod 0644 "$TMPFILE"
mv -f "$TMPFILE" "$TEXTFILE_DIR/my_metric.prom"
Things to remember:
- One scrape interval (~30 sec) of latency between writing the file and seeing the value in Prometheus. Use
_last_run_timestamp_secondsso you can alert on stale collectors. - node_exporter must be running with
--collector.textfile.directory=$TEXTFILE_DIR. Default on most distros. - File ownership matters: writer must be able to overwrite, reader (node_exporter) must be able to read. The lab uses
0644files in a0755directory.
Things that are easy to get wrong¶
for: 5mmatters. A bare alert expression fires on the first scrape. Always setfor:so transient blips do not page.- Use
absent()for "this should be reporting":
This catches "the exporter itself is gone," which a up == 0 alert cannot, because if the target does not exist there is nothing to be 0.
- Cardinality is real. Labels with unbounded values (request URLs, user IDs, container UIDs) blow up Prometheus memory. Drop them at the scrape config or the relabel level.