Prometheus can be configured with rules to monitor timeseries of interest and generate alerts on them. To do this it needs to work with a companion alertmanager which groups the alerts and performs the actual sending (via E-mail, Slack etc)

This exercise is a very brief introduction to alerting and alertmanager.

                                               email,
                                              slack etc
        prometheus -----------> alertmanager ----------> USERS
        [alerting                [routing
          rules]                   rules]

Prometheus is configured to check for alert conditions using PromQL expressions. If an expression returns a non-empty result, then the alert condition is posted to alertmanager, along with any extra labels and annotations you may have configured. Alertmanager then performs alert aggregation and routing, and performs the actual delivery of alerts.

Do this exercise on your campus server instance (srv1.campusX.ws.nsrc.org)

Install alertmanager

(If alertmanager is pre-installed, skip to the next section “Start alertmanager”)

Fetch and unpack the latest release from the releases page and create a symlink so that /opt/alertmanager refers to the current version.

wget https://github.com/prometheus/alertmanager/releases/download/vX.Y.Z/alertmanager-X.Y.Z.linux-amd64.tar.gz
tar -C /opt -xvzf alertmanager-X.Y.Z.linux-amd64.tar.gz
ln -s alertmanager-X.Y.Z.linux-amd64 /opt/alertmanager

Create a data directory:

mkdir /var/lib/alertmanager
chown prometheus:prometheus /var/lib/alertmanager

Use a text editor to create a systemd unit file /etc/systemd/system/alertmanager.service with the following contents:

[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/alertmanager
ExecStart=/opt/alertmanager/alertmanager $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

systemctl daemon-reload

Also create an options file /etc/default/alertmanager with the following contents:

OPTIONS='--config.file=/etc/prometheus/alertmanager.yml --web.external-url=http://srv1.campusX.ws.nsrc.org/alertmanager'

(adjust campusX as appropriate)

Create the initial default configuration:

cp /opt/alertmanager/alertmanager.yml /etc/prometheus/

Start alertmanager

Let’t start alertmanager:

systemctl enable alertmanager  # start on future boots
systemctl start alertmanager   # start now
journalctl -eu alertmanager    # check for "Listening address=:9093"

Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.

Test that the web interface is visible at http://oob.srv1.campusX.ws.nsrc.org/alertmanager (or go to virtual training platform web interface, select Web > srv1 under your campus, and click on Alertmanager)

In this workshop we’ve configured Apache so that the path /alertmanager is proxied to port 9093.

Configure prometheus

Prometheus itself has to know where to send alerts to, and also where to read its alerting rules.

Create a directory where we will put alerting rules:

mkdir /etc/prometheus/rules.d

Edit /etc/prometheus/prometheus.yml. Under alertmanagers/targets uncomment the target and change it so that it looks like this:

# Alertmanager configuration
alerting:
  alertmanagers:
    - path_prefix: /alertmanager
      static_configs:
        - targets:
            - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules.d/*.yml"

Tell prometheus about its changed configuration:

systemctl reload prometheus
journalctl -eu prometheus

Check that the change was accepted without errors.

Create some alerting rules

Create file /etc/prometheus/rules.d/basic.yml with the following contents:

groups:
  - name: basic
    interval: 1m
    rules:
      - alert: UpDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'Scrape failed: host is down or scrape endpoint down/unreachable'

      - alert: FilesystemReadOnly
        expr: node_filesystem_readonly > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'Filesystem is mounted read-only'

      - alert: DiskFull
        expr: node_filesystem_avail_bytes < 100000000 unless node_filesystem_size_bytes < 120000000
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: 'Filesystem full or less than 100MB free space'

(The final rule sends an alert if the filesystem has less than 100MB free, but is suppressed if the filesystem size is less than 120MB - otherwise, small filesystems would continuously alert)

You should get into the habit of checking rules:

/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
  SUCCESS: 1 rule files found

Checking /etc/prometheus/rules.d/basic.yml
  SUCCESS: 3 rules found

Now you can tell prometheus to pick them up:

systemctl reload prometheus
journalctl -eu prometheus

You can check this by going to the prometheus web interface (not the alertmanager web interface!) at http://oob.srv1.campusX.ws.nsrc.org/prometheus and select the “Alerts” tab: you should see you have three Inactive alerts. Click on “Inactive” and you can see the rules.

Testing alerting

Stop your node_exporter:

systemctl stop node_exporter

Return to the alerts page. Within a minute you should see an alert go into “Pending” state, and within another minute it will go into “Firing” state.

Change the URL in your browser from /prometheus/alerts to /alertmanager to get to the alertmanager UI, and you should see the active alert.

Restart your node_exporter:

systemctl start node_exporter

How it works

Look again at the configuration for the “UpDown” alert rule.

At the configured interval (here 1 minute), the queries are evaluated. This interval applies to all rules in the same “rule group”.
“up” returns all the timeseries relating to each scrape job
“up == 0” filters this to only those timeseries which have failed
If the set of timeseries from “up == 0” is not empty, this creates a pending alert. If it remains pending for the “for” time (here 2 minutes) then it becomes “firing” and is sent to alertmanager
Alertmanager can group together alerts which share a common attribute. In the default configuration, all alerts with the same ‘alertname’ which are received within group_wait (30 seconds) are sent in a single alert
Alertmanager optionally matches against a list of routing rules to choose a receiver, falling back to the configured default.

However, the default configuration tries to post to a webhook at localhost:5001, which we don’t have, so nothing is actually sent.

Sending E-mails

To send E-mails, we need to adjust the alertmanager configuration. Edit the file /etc/prometheus/alertmanager.yml and adjust the settings to look like this:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'localhost:25'
  smtp_from: 'prometheus@srv1.campusX.ws.nsrc.org'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'NOC group'

receivers:
  - name: 'NOC group'
    email_configs:
      - to: sysadm@localhost
        send_resolved: true
      - to: you@yourdomain.com    # optional: include your real E-mail address here
        send_resolved: true

inhibit_rules:
  - source_matchers:
      - 'severity = "critical"'
    target_matchers:
      - 'severity = "warning"'
    equal: ['alertname', 'dev', 'instance']

Pick up changes:

systemctl reload alertmanager
journalctl -eu alertmanager

Stop your node exporter again. Watch the alert change status in prometheus and alertmanager over the next 2-3 minutes.

When it’s firing, check if a mail has been delivered:

tail -300 /var/spool/mail/sysadm

(It’s actually a rather verbose HTML E-mail, but it displays prettily when shown in a HTML-aware mail reader). You could also try logging into srv1 as the “sysadm” user and running “mutt”.

If there appear to be problems with mail delivery, check journalctl -eu alertmanager for logs.

Finally, restart your node_exporter:

systemctl restart node_exporter

After 5 minutes (resolve_timeout) you should receive a mail confirming that the problem is over.

More advanced rules

Unlike nagios, prometheus has access to the history of every timeseries in the timeseries database. Therefore given suitable PromQL queries you can make alerts which make use of history:

avg(foo[10m])     -- average value of foo over last 10 minutes

foo offset 5m     -- foo as it was 5 minutes previously

There are further functions which can be useful. Here is an example, which uses the last 3 hours history to fit a line using linear regression, to predict whether the filesystem will be full within 2 days (if it continues the current growth trend).

  - name: DiskRate3h
    interval: 10m
    rules:
      # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
      - alert: DiskFilling3h
        expr: predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 0
        for: 6h
        labels:
          severity: warning
        annotations:
          summary: 'Filesystem will be full in less than 2d at current 3h growth rate'

And here is a slightly more complex version which calculates how long until the filesystem will become full:

  - name: DiskRate3h
    interval: 10m
    rules:
      # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
      - alert: DiskFilling3h
        expr: |
          node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
          (predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 172800) < 0)) * 172800
        for: 6h
        labels:
          severity: warning
        annotations:
          summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 3h growth rate'

The condition is evaluated every 10 minutes, but must remain true for 6 hours to reduce spurious alerts (e.g. disk usage grows for a few hours and then shrinks again). This means you’ll only get up to 18 hours notice.

Such a rule avoids the need for static thresholds, such as 80% or 90% full. However, you should also have a rule that checks for completely-full filesystems, so you get notified of those immediately.

Note on disk space metrics

Some filesystems reserve a certain percentage of space for root only. The metric node_filesystem_avail_bytes shows the amount of space available after subtracting the reserved area, and hence is smaller than node_filesystem_free_bytes. When node_filesystem_avail_bytes hits zero, users other than root can no longer create files, so the filesystem is “full” as far as they are concerned.

Therefore if you want to show the actual percentage of disk space used then you would use node_filesystem_free_bytes; but if you want to check if the disk is effectively “full” then you probably want node_filesystem_avail_bytes.

Optional extra: alertmanager metrics

Alertmanager itself exposes some of its own metrics. If you want, you can scrape them, by adding a new scrape job to prometheus.yml (add this under the scrape_configs section, not under alerting):

  - job_name: alertmanager
    metrics_path: /alertmanager/metrics
    static_configs:
      - targets:
          - localhost:9093

The metrics include:

alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="active"}                        1
alertmanager_alerts{instance="localhost:9093",job="alertmanager",state="suppressed"}                    0
...
alertmanager_notifications_total{instance="localhost:9093",integration="email",job="alertmanager"}      2
alertmanager_notifications_total{instance="localhost:9093",integration="hipchat",job="alertmanager"}    0
alertmanager_notifications_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"}   0
...
alertmanager_notifications_failed_total{instance="localhost:9093",integration="email",job="alertmanager"}    0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="hipchat",job="alertmanager"}  0
alertmanager_notifications_failed_total{instance="localhost:9093",integration="opsgenie",job="alertmanager"} 0
...

You can therefore monitor the current number of active alerts, and how many notifications you’ve sent or have failed to send.

Optional extra: karma dashboard

karma provides a dashboard for alerting, and is particularly useful when you have multiple alertmanager instances spread across multiple campuses or data centres.

If you have time, you can try configuring karma to give a unified view of all the alertmanagers in the different campuses in the workshop.

Install karma

(If karma is pre-installed, skip to the next section “Start karma”)

wget https://github.com/prymitive/karma/releases/download/vX.Y/karma-linux-amd64.tar.gz
mkdir /opt/karma
tar -C /opt/karma -xvzf karma-linux-amd64.tar.gz

Create /etc/systemd/system/karma.service

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/lib/alertmanager
EnvironmentFile=/etc/default/karma
ExecStart=/opt/karma/karma-linux-amd64 $OPTIONS

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

systemctl daemon-reload

Also create an options file /etc/default/karma with the following contents:

OPTIONS='--config.file=/etc/prometheus/karma.yml'

Start karma

Create /etc/prometheus/karma.yml with the following contents (replace campusX with your own campus number):

alertmanager:
  interval: 60s
  servers:
    - name: campus1
      uri: http://srv1.campus1.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus2
      uri: http://srv1.campus2.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus3
      uri: http://srv1.campus3.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus4
      uri: http://srv1.campus4.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus5
      uri: http://srv1.campus5.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
    - name: campus6
      uri: http://srv1.campus6.ws.nsrc.org/alertmanager
      timeout: 10s
      proxy: true
karma:
  name: campusX-karma
listen:
  address: "0.0.0.0"
  port: 8080
  prefix: /karma/

Start it and check for errors:

systemctl enable karma
systemctl start karma
journalctl -eu karma

The dashboard should be visible at http://oob.srv1.campusX.ws.nsrc.org/karma (as usual, for virtual training platform web interface, select Web > srv1 under your campus, then click Karma)

Alerts are grouped together into a pane, where all common labels are at the bottom of the pane and the individual alerts with their unique labels are above it. You can also create silences through karma (it forwards them onto alertmanager).

alerta.io

An alternative to karma you can check out is alerta: https://alerta.io/