Prometheus can be configured with rules to monitor timeseries of interest and generate alerts on them. To do this it needs to work with a companion alertmanager which groups the alerts and performs the actual sending (via E-mail, Slack etc)
This exercise is a very brief introduction to alerting and alertmanager.
email,
slack etc
prometheus -----------> alertmanager ----------> USERS
[alerting [routing
rules] rules]
Prometheus is configured to check for alert conditions using PromQL expressions. If an expression returns a non-empty result, then the alert condition is posted to alertmanager, along with any extra labels and annotations you may have configured. Alertmanager then performs alert aggregation and routing, and performs the actual delivery of alerts.
Do this exercise on your campus server instance (srv1.campusX.ws.nsrc.org)
(If alertmanager is pre-installed, skip to the next section “Start alertmanager”)
Fetch and unpack the latest release from the releases page and create a symlink so that /opt/alertmanager
refers to the current version.
wget https://github.com/prometheus/alertmanager/releases/download/vX.Y.Z/alertmanager-X.Y.Z.linux-amd64.tar.gz
tar -C /opt -xvzf alertmanager-X.Y.Z.linux-amd64.tar.gz
ln -s alertmanager-X.Y.Z.linux-amd64 /opt/alertmanager
Create a data directory:
mkdir /var/lib/alertmanager
chown prometheus:prometheus /var/lib/alertmanager
Use a text editor to create a systemd unit file /etc/systemd/system/alertmanager.service
with the following contents:
[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network-online.target
[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/alertmanager
ExecStart=/opt/alertmanager/alertmanager $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
Tell systemd to read this new file:
systemctl daemon-reload
Also create an options file /etc/default/alertmanager
with the following contents:
OPTIONS='--config.file=/etc/prometheus/alertmanager.yml --web.external-url=http://srv1.campusX.ws.nsrc.org/alertmanager'
(adjust campusX as appropriate)
Create the initial default configuration:
cp /opt/alertmanager/alertmanager.yml /etc/prometheus/
Let’t start alertmanager:
systemctl enable alertmanager # start on future boots
systemctl start alertmanager # start now
journalctl -eu alertmanager # check for "Listening address=:9093"
Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.
Test that the web interface is visible at http://oob.srv1.campusX.ws.nsrc.org/alertmanager (or go to virtual training platform web interface, select Web > srv1 under your campus, and click on Alertmanager)
In this workshop we’ve configured Apache so that the path /alertmanager
is proxied to port 9093.
Prometheus itself has to know where to send alerts to, and also where to read its alerting rules.
Create a directory where we will put alerting rules:
mkdir /etc/prometheus/rules.d
Edit /etc/prometheus/prometheus.yml
. Under alertmanagers/targets uncomment the target and change it so that it looks like this:
# Alertmanager configuration
alerting:
alertmanagers:
- path_prefix: /alertmanager
static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/rules.d/*.yml"
Tell prometheus about its changed configuration:
systemctl reload prometheus
journalctl -eu prometheus
Check that the change was accepted without errors.
Create file /etc/prometheus/rules.d/basic.yml
with the following contents:
groups:
- name: basic
interval: 1m
rules:
- alert: UpDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: 'Scrape failed: host is down or scrape endpoint down/unreachable'
- alert: FilesystemReadOnly
expr: node_filesystem_readonly > 0
for: 2m
labels:
severity: warning
annotations:
summary: 'Filesystem is mounted read-only'
- alert: DiskFull
expr: node_filesystem_avail_bytes < 100000000 unless node_filesystem_size_bytes < 120000000
for: 10m
labels:
severity: critical
annotations:
summary: 'Filesystem full or less than 100MB free space'
(The final rule sends an alert if the filesystem has less than 100MB free, but is suppressed if the filesystem size is less than 120MB - otherwise, small filesystems would continuously alert)
You should get into the habit of checking rules:
/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
SUCCESS: 1 rule files found
Checking /etc/prometheus/rules.d/basic.yml
SUCCESS: 3 rules found
Now you can tell prometheus to pick them up:
systemctl reload prometheus
journalctl -eu prometheus
You can check this by going to the prometheus web interface (not the alertmanager web interface!) at http://oob.srv1.campusX.ws.nsrc.org/prometheus and select the “Alerts” tab: you should see you have three Inactive alerts. Click on “Inactive” and you can see the rules.
Stop your node_exporter:
systemctl stop node_exporter
Return to the alerts page. Within a minute you should see an alert go into “Pending” state, and within another minute it will go into “Firing” state.
Change the URL in your browser from /prometheus/alerts
to /alertmanager
to get to the alertmanager UI, and you should see the active alert.
Restart your node_exporter:
systemctl start node_exporter
Look again at the configuration for the “UpDown” alert rule.
However, the default configuration tries to post to a webhook at localhost:5001
, which we don’t have, so nothing is actually sent.
To send E-mails, we need to adjust the alertmanager configuration. Edit the file /etc/prometheus/alertmanager.yml and adjust the settings to look like this:
global:
resolve_timeout: 5m
smtp_smarthost: 'localhost:25'
smtp_from: 'prometheus@srv1.campusX.ws.nsrc.org'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'NOC group'
receivers:
- name: 'NOC group'
email_configs:
- to: sysadm@localhost
send_resolved: true
- to: you@yourdomain.com # optional: include your real E-mail address here
send_resolved: true
inhibit_rules:
- source_matchers:
- 'severity = "critical"'
target_matchers:
- 'severity = "warning"'
equal: ['alertname', 'dev', 'instance']
Pick up changes:
systemctl reload alertmanager
journalctl -eu alertmanager
Stop your node exporter again. Watch the alert change status in prometheus and alertmanager over the next 2-3 minutes.
When it’s firing, check if a mail has been delivered:
tail -300 /var/spool/mail/sysadm
(It’s actually a rather verbose HTML E-mail, but it displays prettily when shown in a HTML-aware mail reader). You could also try logging into srv1 as the “sysadm” user and running “mutt”.
If there appear to be problems with mail delivery, check journalctl -eu alertmanager
for logs.
Finally, restart your node_exporter:
systemctl restart node_exporter
After 5 minutes (resolve_timeout) you should receive a mail confirming that the problem is over.
Unlike nagios, prometheus has access to the history of every timeseries in the timeseries database. Therefore given suitable PromQL queries you can make alerts which make use of history:
avg(foo[10m]) -- average value of foo over last 10 minutes
foo offset 5m -- foo as it was 5 minutes previously
There are further functions which can be useful. Here is an example, which uses the last 3 hours history to fit a line using linear regression, to predict whether the filesystem will be full within 2 days (if it continues the current growth trend).
- name: DiskRate3h
interval: 10m
rules:
# Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
- alert: DiskFilling3h
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 0
for: 6h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in less than 2d at current 3h growth rate'
And here is a slightly more complex version which calculates how long until the filesystem will become full:
- name: DiskRate3h
interval: 10m
rules:
# Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
- alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 172800) < 0)) * 172800
for: 6h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 3h growth rate'
The condition is evaluated every 10 minutes, but must remain true for 6 hours to reduce spurious alerts (e.g. disk usage grows for a few hours and then shrinks again). This means you’ll only get up to 18 hours notice.
Such a rule avoids the need for static thresholds, such as 80% or 90% full. However, you should also have a rule that checks for completely-full filesystems, so you get notified of those immediately.
Some filesystems reserve a certain percentage of space for root only. The metric node_filesystem_avail_bytes
shows the amount of space available after subtracting the reserved area, and hence is smaller than node_filesystem_free_bytes
. When node_filesystem_avail_bytes
hits zero, users other than root can no longer create files, so the filesystem is “full” as far as they are concerned.
Therefore if you want to show the actual percentage of disk space used then you would use node_filesystem_free_bytes
; but if you want to check if the disk is effectively “full” then you probably want node_filesystem_avail_bytes
.
Alertmanager itself exposes some of its own metrics. If you want, you can scrape them, by adding a new scrape job to prometheus.yml
(add this under the scrape_configs
section, not under alerting
):
- job_name: alertmanager
metrics_path: /alertmanager/metrics
static_configs:
- targets:
- localhost:9093
The metrics include:
alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="active"} 1
alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="suppressed"} 0
...
alertmanager_notifications_total{instance="localhost:9093",integration="email",job="alertmanager"} 2
alertmanager_notifications_total{instance="localhost:9093",integration="hipchat",job="alertmanager"} 0
alertmanager_notifications_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"} 0
...
alertmanager_notifications_failed_total{instance="localhost:9093",integration="email",job="alertmanager"} 0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="hipchat",job="alertmanager"} 0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"} 0
...
You can therefore monitor the current number of active alerts, and how many notifications you’ve sent or have failed to send.
karma provides a dashboard for alerting, and is particularly useful when you have multiple alertmanager instances spread across multiple campuses or data centres.
If you have time, you can try configuring karma to give a unified view of all the alertmanagers in the different campuses in the workshop.
(If karma is pre-installed, skip to the next section “Start karma”)
wget https://github.com/prymitive/karma/releases/download/vX.Y/karma-linux-amd64.tar.gz
mkdir /opt/karma
tar -C /opt/karma -xvzf karma-linux-amd64.tar.gz
Create /etc/systemd/system/karma.service
[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/karma
ExecStart=/opt/karma/karma-linux-amd64 $OPTIONS
[Install]
WantedBy=multi-user.target
Tell systemd to read this new file:
systemctl daemon-reload
Also create an options file /etc/default/karma
with the following contents:
OPTIONS='--config.file=/etc/prometheus/karma.yml'
Create /etc/prometheus/karma.yml
with the following contents (replace campusX with your own campus number):
alertmanager:
interval: 60s
servers:
- name: campus1
uri: http://srv1.campus1.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
- name: campus2
uri: http://srv1.campus2.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
- name: campus3
uri: http://srv1.campus3.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
- name: campus4
uri: http://srv1.campus4.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
- name: campus5
uri: http://srv1.campus5.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
- name: campus6
uri: http://srv1.campus6.ws.nsrc.org/alertmanager
timeout: 10s
proxy: true
karma:
name: campusX-karma
listen:
address: "0.0.0.0"
port: 8080
prefix: /karma/
Start it and check for errors:
systemctl enable karma
systemctl start karma
journalctl -eu karma
The dashboard should be visible at http://oob.srv1.campusX.ws.nsrc.org/karma (as usual, for virtual training platform web interface, select Web > srv1 under your campus, then click Karma)
Alerts are grouped together into a pane, where all common labels are at the bottom of the pane and the individual alerts with their unique labels are above it. You can also create silences through karma (it forwards them onto alertmanager).
An alternative to karma you can check out is alerta: https://alerta.io/