Prometheus can be configured with rules to monitor timeseries of interest and generate alerts on them. To do this it needs to work with a companion alertmanager which groups the alerts and performs the actual sending (via E-mail, Slack etc)

This exercise is a very brief introduction to alertmanager.

Do this on your campus server instance (srv1.campusX.ws.nsrc.org)

1 Install alertmanager

(If alertmanager is pre-installed, skip to the next section “Start alertmanager”)

Fetch and unpack the latest release from the releases page and create a symlink so that /opt/alertmanager refers to the current version.

wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar -C /opt -xvzf alertmanager-0.21.0.linux-amd64.tar.gz
ln -s alertmanager-0.21.0.linux-amd64 /opt/alertmanager

Create a data directory:

mkdir /var/lib/alertmanager
chown prometheus:prometheus /var/lib/alertmanager

Use a text editor to create a systemd unit file /etc/systemd/system/alertmanager.service with the following contents:

[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/alertmanager
ExecStart=/opt/alertmanager/alertmanager $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

systemctl daemon-reload

Also create an options file /etc/default/alertmanager with the following contents:

OPTIONS='--config.file=/etc/prometheus/alertmanager.yml --web.external-url=http://srv1.campusX.ws.nsrc.org/alertmanager'

(adjust campusX as appropriate)

Create the initial default configuration:

cp /opt/alertmanager/alertmanager.yml /etc/prometheus/

2 Start alertmanager

Let’t start alertmanager:

systemctl enable alertmanager  # start on future boots
systemctl start alertmanager   # start now
journalctl -eu alertmanager    # check for "Listening address=:9093"

Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.

Test that the web interface is visible at http://oob.srv1.campusX.ws.nsrc.org/alertmanager

In this workshop we’ve configured Apache so that the path /alertmanager is proxied to port 9093.

3 Configure prometheus

Prometheus itself has to know where to send alerts to, and also where to read its alerting rules.

Create a directory where we will put alerting rules:

mkdir /etc/prometheus/rules.d

Edit /etc/prometheus/prometheus.yml. Under alertmanagers/targets uncomment the target and change it so that it looks like this:

# Alertmanager configuration
alerting:
  alertmanagers:
    - path_prefix: /alertmanager
      static_configs:
        - targets: ['localhost:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules.d/*.yml"

Tell prometheus about its changed configuration:

killall -HUP prometheus
journalctl -eu prometheus

Check that the change was accepted without errors.

4 Create some alerting rules

Create file /etc/prometheus/rules.d/basic.yml with the following contents:

groups:
  - name: basic
    interval: 1m
    rules:
      - alert: UpDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'Scrape failed: host is down or scrape endpoint down/unreachable'

      - alert: FilesystemReadOnly
        expr: node_filesystem_readonly > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'Filesystem is mounted read-only'

      - alert: DiskFull
        expr: node_filesystem_avail_bytes < 100000000 and node_filesystem_size_bytes > 100000000
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: 'Filesystem full or less than 100MB free space'

You should get into the habit of checking rules:

/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
  SUCCESS: 1 rule files found

Checking /etc/prometheus/rules.d/basic.yml
  SUCCESS: 3 rules found

Now you can tell prometheus to pick them up:

killall -HUP prometheus
journalctl -eu prometheus

If you go to the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus and select the “Alerts” tab, you should see you have three Inactive alerts. Click on “Inactive” and you can see the rules.

5 Testing alerting

Stop your node_exporter:

systemctl stop node_exporter

Return to the alerts page. Within a minute you should see an alert go into “Pending” state, and within another minute it will go into “Firing” state.

Change the URL in your browser from /prometheus/alerts to /alertmanager to get to the alertmanager UI, and you should see the active alert.

Restart your node_exporter:

systemctl start node_exporter

5.1 How it works

Look again at the configuration for the “UpDown” alert rule.

At the configured interval (here 1 minute), the queries are evaluated
“up” returns all the timeseries relating to each scrape job
“up == 0” filters this to only those timeseries which have failed
If the set of timeseries from “up == 0” is not empty, this creates a pending alert. If it remains pending for the “for” time (here 2 minutes) then it becomes “firing” and is sent to alertmanager
Alertmanager can group together alerts which share a common attribute. In the default configuration, all alerts with the same ‘alertname’ which are received within group_wait (10 seconds) are sent in a single alert
Alertmanager optionally matches against a list of routing rules to choose a receiver, falling back to the configured default.

However, the default configuration tries to post to a webhook at localhost:5001, which we don’t have, so nothing is actually sent.

6 Sending E-mails

To send E-mails, we need to adjust the alertmanager configuration:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'localhost:25'
  smtp_from: 'prometheus@srv1.campusX.ws.nsrc.org'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'NOC group'
receivers:
  - name: 'NOC group'
    email_configs:
      - to: sysadm@localhost
        send_resolved: true
      - to: you@yourdomain.com    # optional: include your real E-mail address here
        send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Pick up changes:

killall -HUP alertmanager
journalctl -eu alertmanager

Stop your node exporter again. Watch the alert change status in prometheus and alertmanager over the next 2-3 minutes.

When it’s firing, check if a mail has been delivered:

tail -300 /var/spool/mail/sysadm

(It’s actually a rather verbose HTML E-mail, but it displays prettily when shown in a HTML-aware mail reader)

If there appear to be problems with mail delivery, check journalctl -eu alertmanager for logs.

Finally, restart your node_exporter:

systemctl restart node_exporter

After 5 minutes (resolve_timeout) you should receive a mail confirming that the problem is over.

7 More advanced rules

Unlike nagios, prometheus has access to the history of every timeseries in the timeseries database. Therefore given suitable PromQL queries you can make alerts which make use of history:

avg(foo[10m])     -- average value of foo over last 10 minutes

foo offset 5m     -- foo as it was 5 minutes previously

There are further functions which can be useful. Here is an example, which uses the last 3 hours history to fit a line using linear regression, to predict whether the filesystem will be full within 2 days (if it continues the current growth trend).

  - name: DiskRate3h
    interval: 10m
    rules:
      # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
      - alert: DiskFilling
        expr: predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 0
        for: 6h
        labels:
          severity: warning
        annotations:
          summary: 'Filesystem will be full in less than 2d at current 3h growth rate'

The condition is evaluated every 10 minutes, but must remain true for 6 hours to reduce spurious alerts (e.g. disk usage grows for a few hours and then shrinks again).

Such a rule avoids the need for static thresholds, such as 80% or 90% full. However, you should also have a rule that checks for completely-full filesystems, so you get notified of those immediately.

7.1 Note on disk space metrics

Some filesystems reserve a certain percentage of space for root only. The metric node_filesystem_avail_bytes shows the amount of space available after subtracting the reserved area, and hence is smaller than node_filesystem_free_bytes. When node_filesystem_avail_bytes hits zero, users other than root can no longer create files, so the filesystem is “full” as far as they are concerned.

Therefore if you want to show the actual percentage of disk space used then you would use node_filesystem_free_bytes; but if you want to check if the disk is effectively “full” then you probably want node_filesystem_avail_bytes.

8 Optional extra: alertmanager metrics

Alertmanager itself exposes some of its own metrics. If you want, you can scrape them, by adding a new scrape job to prometheus.yml:

  - job_name: alertmanager
    metrics_path: /alertmanager/metrics
    static_configs:
      - targets: ['localhost:9093']

The metrics include:

alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="suppressed"}                    0
alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="active"}                        1

alertmanager_notifications_total{instance="localhost:9093",integration="email",job="alertmanager"}      2
alertmanager_notifications_total{instance="localhost:9093",integration="hipchat",job="alertmanager"}    0
alertmanager_notifications_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"}   0
...
alertmanager_notifications_failed_total{instance="localhost:9093",integration="email",job="alertmanager"}    0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="hipchat",job="alertmanager"}  0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"} 0
...

You can therefore monitor the current number of active alerts, and how many notifications you’ve sent or have failed to send.

9 Optional extra: karma dashboard

karma provides a dashboard for alerting, and is particularly useful when you have multiple alertmanager instances spread across multiple campuses or data centres.

If you have time, you can try configuring karma to give a unified view of all the alertmanagers in the different campuses in the workshop.

9.1 Install karma

(If karma is pre-installed, skip to the next section “Start karma”)

wget https://github.com/prymitive/karma/releases/download/v0.71/karma-linux-amd64.tar.gz
mkdir /opt/karma
tar -C /opt/karma -xvzf karma-linux-amd64.tar.gz

Create /etc/systemd/system/karma.service

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/karma
ExecStart=/opt/karma/karma-linux-amd64 $OPTIONS

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

systemctl daemon-reload

Also create an options file /etc/default/karma with the following contents:

OPTIONS='--config.file=/etc/prometheus/karma.yml'

9.2 Start karma

Create /etc/prometheus/karma.yml with the following contents:

alertmanager:
  interval: 60s
  servers:
    - name: campus1
      uri: http://srv1.campus1.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus2
      uri: http://srv1.campus2.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus3
      uri: http://srv1.campus3.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus4
      uri: http://srv1.campus4.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus5
      uri: http://srv1.campus5.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus6
      uri: http://srv1.campus6.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
karma:
  name: campusX-karma
listen:
  address: "0.0.0.0"
  port: 8080
  prefix: /karma/

Start it and check for errors:

systemctl enable karma
systemctl start karma
journalctl -eu karma

The dashboard should be visible at http://oob.srv1.campusX.ws.nsrc.org/karma

Alerts are grouped together into a pane, where all common labels are at the bottom of the pane and the individual alerts with their unique labels are above it. You can also create silences through karma (it forwards them onto alertmanager).

9.3 alerta.io

An alternative to karma you can check out is alerta: https://alerta.io/

10 Further reading

Philosophy on alerting:
- https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/ (from a Google site engineer)
- https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
- https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
- https://landing.google.com/sre/sre-book/chapters/practical-alerting/
- http://www.brendangregg.com/usemethod.html
- https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html
https://www.robustperception.io/using-time-series-as-alert-thresholds
https://www.haraldkoch.ca/blog/index.php/2020/03/14/prometheus-alerting-rules-and-metadata/
Routing tree editor - lets you test alert routing interactively