Prometheus can perform Nagios-like “blackbox” service testing - such as pings, http queries and DNS checks - using the blackbox_exporter.

The setup is somewhat like snmp_exporter. You run the blackbox_exporter at the place where the checks should originate - this can be on your prometheus server (although it doesn’t have to be). You then scrape blackbox_exporter from prometheus, and each scrape triggers a check.

Do this exercise on your campus srv1 server.

Install blackbox_exporter

(If blackbox_exporter is pre-installed, skip to the next section “Start blackbox_exporter”)

Fetch and unpack the latest release from the releases page and create a symlink so that /opt/blackbox_exporter refers to the current version.

wget https://github.com/prometheus/blackbox_exporter/releases/download/vX.Y.Z/blackbox_exporter-X.Y.Z.linux-amd64.tar.gz
tar -C /opt -xvzf blackbox_exporter-X.Y.Z.linux-amd64.tar.gz
ln -s blackbox_exporter-X.Y.Z.linux-amd64 /opt/blackbox_exporter

Use a text editor to create a systemd unit file /etc/systemd/system/blackbox_exporter.service with the following contents:

[Unit]
Description=Prometheus Blackbox Exporter
Documentation=https://github.com/prometheus/blackbox_exporter
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
EnvironmentFile=/etc/default/blackbox_exporter
AmbientCapabilities=CAP_NET_RAW
ExecStart=/opt/blackbox_exporter/blackbox_exporter $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

systemctl daemon-reload

Also create an options file /etc/default/blackbox_exporter with the following contents:

OPTIONS='--web.listen-address=127.0.0.1:9115 --config.file=/etc/prometheus/blackbox.yml'

Start blackbox_exporter

There is a sample configuration in /opt/blackbox_exporter/blackbox.yml but we will create one from scratch.

Create /etc/prometheus/blackbox.yml with the following contents:

modules:
  ping4:
    prober: icmp
    timeout: 3s
    icmp:
      preferred_ip_protocol: ip4
      ip_protocol_fallback: false

Now start blackbox_exporter:

systemctl enable blackbox_exporter  # start on future boots
systemctl start blackbox_exporter   # start now
journalctl -eu blackbox_exporter    # check for msg="Listening on address" address=127.0.0.1:9115

Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.

Manual test

Test using curl. You need to provide two arguments: the name of a module you have defined in blackbox.yml, and the target (hostname or address to test). You will need to quote the URL because it contains the special & character.

curl 'localhost:9115/probe?target=nsrc.org&module=ping4'

The response should include probe_success 1 if the remote host responded, plus metrics for the time taken for the DNS query and the time taken for the overall probe.

Unlike nagios: this only sends a single ping. Therefore, if that one packet gets lost, you will get a success of 0. We’ll see how to deal with this shortly.

NOTE: if you have a problem, you can get detailed logs from the probe in the HTTP response by adding &debug=true to the URL:

curl 'localhost:9115/probe?target=nsrc.org&module=ping4&debug=true'

This will usually make it clear at what step the probe encountered a problem.

Configure prometheus

We now need to configure prometheus to scrape the blackbox_exporter.

Firstly, configure a targets file /etc/prometheus/targets.d/blackbox.yml containing the following (remembering to replace campusX as required):

- labels:
    module: ping4
  targets:
    - 1.2.3.4
    - nsrc.org
    - gw.ws.nsrc.org
    - bdr1.campusX.ws.nsrc.org
    - core1.campusX.ws.nsrc.org

(1.2.3.4 is an address which we know is not going to respond)

Edit /etc/prometheus/prometheus.yml and add the following to the bottom of the scrape_configs: section:

  - job_name: 'blackbox'
    file_sd_configs:
      - files:
         - /etc/prometheus/targets.d/blackbox.yml
    metrics_path: /probe
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [module]
        target_label: __param_module
      - target_label: __address__
        replacement: 127.0.0.1:9115  # blackbox exporter

Be careful with spacing. The dash before job_name should align exactly with the dashes of earlier job_name entries. For an explanation of this section, see the end of the “snmp_exporter” exercise or the multi-target exporter guide in the Prometheus documentation.

Now get prometheus to pick up the changes:

systemctl reload prometheus
journalctl -eu prometheus    # CHECK FOR ERRORS!

Testing

Return to the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus

Run the following queries:

up{job="blackbox"}

probe_success

Notice how for the dummy target 1.2.3.4, we have metric up 1 (meaning that the scrape job was able to communicate successfully with blackbox_exporter) but probe_success 0 (which is the result from blackbox_exporter).

If you want to get the proportion of packets lost over time, you will need to use a prometheus range query for this:

avg_over_time(probe_success[5m])

This is how you can get a poor man’s test for the proportion of packets lost (although you may be better off with smokeping or perfsonar for this application).

Alternatively you could use another exporter such as ping_exporter or smokeping_prober

DNS and HTTP checks

The configuration file can be extended with other types of tests.

Edit /etc/prometheus/blackbox.yml so it looks like this:

modules:
  ping4:
    prober: icmp
    timeout: 3s
    icmp:
      preferred_ip_protocol: ip4
      ip_protocol_fallback: false

  # Check DNS recursor with query for "nsrc.org"
  dns_udp4_nsrc.org:
    prober: dns
    timeout: 5s
    dns:
      transport_protocol: udp
      preferred_ip_protocol: ip4
      query_name: nsrc.org
      query_type: A

  # Check TLS certificates
  tls_certificate:
    prober: tcp
    timeout: 5s
    tcp:
      tls: true
      tls_config: {}
      preferred_ip_protocol: ip4

Tell blackbox_exporter to reload its config:

systemctl reload blackbox_exporter
journalctl -eu blackbox_exporter     # CHECK FOR ERRORS

Test:

curl 'localhost:9115/probe?target=8.8.8.8&module=dns_udp4_nsrc.org'

curl 'localhost:9115/probe?target=www.google.com:443&module=tls_certificate'

Notice how probe_duration_seconds gives you the latency. For the certificate test, the result includes the TLS version and the certificate expiry time (very useful for alerting on certificates before they expire)

Again, if there’s any problem, you can add &debug=true to see a log of all the steps the probe is doing:

curl 'localhost:9115/probe?target=8.8.8.8&module=dns_udp4_nsrc.org&debug=true'

curl 'localhost:9115/probe?target=www.google.com:443&module=tls_certificate&debug=true'

Now edit /etc/prometheus/targets.d/blackbox.yml to add some more targets which use these modules:

- labels:
    module: ping4
  targets:
    - 1.2.3.4
    - nsrc.org
    - gw.ws.nsrc.org
    - bdr1.campusX.ws.nsrc.org
    - core1.campusX.ws.nsrc.org

- labels:
    module: dns_udp4_nsrc.org
  targets:
    - 1.1.1.1
    - 8.8.8.8

- labels:
    module: tls_certificate
  targets:
    - www.google.com:443
    - www.nsrc.org:443

There is no need to reload prometheus, the changes in the targets file will be picked up immediately.

You should see the results in the prometheus web interface if you do queries for:

probe_success

probe_duration_seconds

probe_duration_seconds and (probe_success == 1)

probe_tls_version_info

(probe_ssl_earliest_cert_expiry - time()) / 86400

The final query tells you how many days until the certificate expires.

Blackbox_exporter also has a module for making http(s) queries: it can send a specific request and headers, match the response body using regular expressions, and more. See the links under “References” for more details.

Additional information

The remainder of this worksheet is background information - you don’t need to do it in the lab (although feel free if you have time)

Prometheus as a Smokeping replacement

Blackbox_exporter only sends a single ping at a time, which either succeeds or fails. But if you have a 15 second scrape interval, then over 5 minutes you will have sent 20 pings - which means you can do Smokeping-like measurements of packet loss and jitter.

Given that the metric probe_success has value 1 on success and 0 on failure, then to graph the fraction of successful packets all you need is:

avg_over_time(probe_success[5m])

This takes all the available values between the evaluation time and 5 minutes before the evaluation time, and averages them. Subtract this value from 1, if you prefer to show packet loss rather than packet success.

Latencies are collected in separate metrics probe_duration_seconds and probe_icmp_duration_sections - the latter breaks down into separate phases like the time for setup and DNS resolution separately from the round-trip time.

To get the minimum, median and maximum round-trip times over 20 packets, you can use these queries:

min_over_time(probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.5, probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])
max_over_time(probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])

In Grafana, you could create a panel with 11 queries to cover the deciles (i.e. 10th percentile, 20th percentile etc):

min_over_time(probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.1, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.2, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
...
quantile_over_time(0.9, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
max_over_time(probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])

By adding some “Fill below to…” overrides in Grafana, you can get a display which is very similar to Smokeping!

Decreasing the interval of the scrape job can give a higher resolution of packet loss detection than Smokeping can provide, and faster notification of outages. The following query can detect when a line has been down for 30 seconds continuously:

min_over_time(probe_success[30s]) == 0

For fast polling of many targets, you may prefer to use the smokeping_prober which offloads the polling work from Prometheus itself, and summarises latency information in the form of histogram buckets.

Other approaches to service checking

There are other ways you can integrate service checks with prometheus.

It is possible to run nagios plugins unchanged using nrpe. First you have to install nrpe and configure it to run the plugins. Then you can talk to the nrpe daemon using nrpe_exporter.

This is fine for returning ok/warning/critical status, but will not return extra data from the plugin (such as textual result data and nagios metrics). Also, if you want to use nrpe’s weird flavour of SSL then you need to build nrpe_exporter yourself from source. This can be useful when talking to appliances like pfSense which support nrpe but don’t generate prometheus metrics.

It’s also possible to use exporter_exporter, which has the ability to run external scripts. Your existing nagios scripts would need to be modified to return data in prometheus metric exposition format.

Finally, you can write service checks which run periodically and write their results into a file for node-exporter’s textfile collector. This approach is described in the “custom metrics” exercise. It’s especially useful for service checks which are slow or expensive to run.

Further reading