Prometheus can perform Nagios-like “blackbox” service testing - such as pings, http queries and DNS checks - using the blackbox_exporter.
The setup is somewhat like snmp_exporter. You run the blackbox_exporter at the place where the checks should originate - this can be on your prometheus server (although it doesn’t have to be). You then scrape blackbox_exporter from prometheus, and each scrape triggers a check.
Do this exercise on your campus srv1 server.
(If blackbox_exporter is pre-installed, skip to the next section “Start blackbox_exporter”)
Fetch and unpack the latest release from the releases page and create a symlink so that /opt/blackbox_exporter
refers to the current version.
wget https://github.com/prometheus/blackbox_exporter/releases/download/vX.Y.Z/blackbox_exporter-X.Y.Z.linux-amd64.tar.gz
tar -C /opt -xvzf blackbox_exporter-X.Y.Z.linux-amd64.tar.gz
ln -s blackbox_exporter-X.Y.Z.linux-amd64 /opt/blackbox_exporter
Use a text editor to create a systemd unit file /etc/systemd/system/blackbox_exporter.service
with the following contents:
[Unit]
Description=Prometheus Blackbox Exporter
Documentation=https://github.com/prometheus/blackbox_exporter
After=network-online.target
[Service]
User=prometheus
Restart=on-failure
RestartSec=5
EnvironmentFile=/etc/default/blackbox_exporter
AmbientCapabilities=CAP_NET_RAW
ExecStart=/opt/blackbox_exporter/blackbox_exporter $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
Tell systemd to read this new file:
systemctl daemon-reload
Also create an options file /etc/default/blackbox_exporter
with the following contents:
OPTIONS='--web.listen-address=127.0.0.1:9115 --config.file=/etc/prometheus/blackbox.yml'
There is a sample configuration in /opt/blackbox_exporter/blackbox.yml
but we will create one from scratch.
Create /etc/prometheus/blackbox.yml
with the following contents:
modules:
ping4:
prober: icmp
timeout: 3s
icmp:
preferred_ip_protocol: ip4
ip_protocol_fallback: false
Now start blackbox_exporter:
systemctl enable blackbox_exporter # start on future boots
systemctl start blackbox_exporter # start now
journalctl -eu blackbox_exporter # check for msg="Listening on address" address=127.0.0.1:9115
Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.
Test using curl. You need to provide two arguments: the name of a module you have defined in blackbox.yml, and the target (hostname or address to test). You will need to quote the URL because it contains the special &
character.
curl 'localhost:9115/probe?target=nsrc.org&module=ping4'
The response should include probe_success 1
if the remote host responded, plus metrics for the time taken for the DNS query and the time taken for the overall probe.
Unlike nagios: this only sends a single ping. Therefore, if that one packet gets lost, you will get a success of 0. We’ll see how to deal with this shortly.
NOTE: if you have a problem, you can get detailed logs from the probe in the HTTP response by adding &debug=true
to the URL:
curl 'localhost:9115/probe?target=nsrc.org&module=ping4&debug=true'
This will usually make it clear at what step the probe encountered a problem.
We now need to configure prometheus to scrape the blackbox_exporter.
Firstly, configure a targets file /etc/prometheus/targets.d/blackbox.yml
containing the following (remembering to replace campusX
as required):
- labels:
module: ping4
targets:
- 1.2.3.4
- nsrc.org
- gw.ws.nsrc.org
- bdr1.campusX.ws.nsrc.org
- core1.campusX.ws.nsrc.org
(1.2.3.4 is an address which we know is not going to respond)
Edit /etc/prometheus/prometheus.yml
and add the following to the bottom of the scrape_configs:
section:
- job_name: 'blackbox'
file_sd_configs:
- files:
- /etc/prometheus/targets.d/blackbox.yml
metrics_path: /probe
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__address__]
target_label: __param_target
- source_labels: [module]
target_label: __param_module
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox exporter
Be careful with spacing. The dash before job_name
should align exactly with the dashes of earlier job_name
entries. For an explanation of this section, see the end of the “snmp_exporter” exercise or the multi-target exporter guide in the Prometheus documentation.
Now get prometheus to pick up the changes:
systemctl reload prometheus
journalctl -eu prometheus # CHECK FOR ERRORS!
Return to the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus
Run the following queries:
up{job="blackbox"}
probe_success
Notice how for the dummy target 1.2.3.4, we have metric up 1
(meaning that the scrape job was able to communicate successfully with blackbox_exporter) but probe_success 0
(which is the result from blackbox_exporter).
If you want to get the proportion of packets lost over time, you will need to use a prometheus range query for this:
avg_over_time(probe_success[5m])
This is how you can get a poor man’s test for the proportion of packets lost (although you may be better off with smokeping or perfsonar for this application).
Alternatively you could use another exporter such as ping_exporter or smokeping_prober
The configuration file can be extended with other types of tests.
Edit /etc/prometheus/blackbox.yml
so it looks like this:
modules:
ping4:
prober: icmp
timeout: 3s
icmp:
preferred_ip_protocol: ip4
ip_protocol_fallback: false
# Check DNS recursor with query for "nsrc.org"
dns_udp4_nsrc.org:
prober: dns
timeout: 5s
dns:
transport_protocol: udp
preferred_ip_protocol: ip4
query_name: nsrc.org
query_type: A
# Check TLS certificates
tls_certificate:
prober: tcp
timeout: 5s
tcp:
tls: true
tls_config: {}
preferred_ip_protocol: ip4
Tell blackbox_exporter to reload its config:
systemctl reload blackbox_exporter
journalctl -eu blackbox_exporter # CHECK FOR ERRORS
Test:
curl 'localhost:9115/probe?target=8.8.8.8&module=dns_udp4_nsrc.org'
curl 'localhost:9115/probe?target=www.google.com:443&module=tls_certificate'
Notice how probe_duration_seconds
gives you the latency. For the certificate test, the result includes the TLS version and the certificate expiry time (very useful for alerting on certificates before they expire)
Again, if there’s any problem, you can add &debug=true
to see a log of all the steps the probe is doing:
curl 'localhost:9115/probe?target=8.8.8.8&module=dns_udp4_nsrc.org&debug=true'
curl 'localhost:9115/probe?target=www.google.com:443&module=tls_certificate&debug=true'
Now edit /etc/prometheus/targets.d/blackbox.yml
to add some more targets which use these modules:
- labels:
module: ping4
targets:
- 1.2.3.4
- nsrc.org
- gw.ws.nsrc.org
- bdr1.campusX.ws.nsrc.org
- core1.campusX.ws.nsrc.org
- labels:
module: dns_udp4_nsrc.org
targets:
- 1.1.1.1
- 8.8.8.8
- labels:
module: tls_certificate
targets:
- www.google.com:443
- www.nsrc.org:443
There is no need to reload prometheus, the changes in the targets file will be picked up immediately.
You should see the results in the prometheus web interface if you do queries for:
probe_success
probe_duration_seconds
probe_duration_seconds and (probe_success == 1)
probe_tls_version_info
(probe_ssl_earliest_cert_expiry - time()) / 86400
The final query tells you how many days until the certificate expires.
Blackbox_exporter also has a module for making http(s) queries: it can send a specific request and headers, match the response body using regular expressions, and more. See the links under “References” for more details.
The remainder of this worksheet is background information - you don’t need to do it in the lab (although feel free if you have time)
Blackbox_exporter only sends a single ping at a time, which either succeeds or fails. But if you have a 15 second scrape interval, then over 5 minutes you will have sent 20 pings - which means you can do Smokeping-like measurements of packet loss and jitter.
Given that the metric probe_success
has value 1 on success and 0 on failure, then to graph the fraction of successful packets all you need is:
avg_over_time(probe_success[5m])
This takes all the available values between the evaluation time and 5 minutes before the evaluation time, and averages them. Subtract this value from 1, if you prefer to show packet loss rather than packet success.
Latencies are collected in separate metrics probe_duration_seconds
and probe_icmp_duration_sections
- the latter breaks down into separate phases like the time for setup and DNS resolution separately from the round-trip time.
To get the minimum, median and maximum round-trip times over 20 packets, you can use these queries:
min_over_time(probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.5, probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])
max_over_time(probe_icmp_duration_seconds{instance="nsrc.org",module="ping4",phase="rtt"}[5m])
In Grafana, you could create a panel with 11 queries to cover the deciles (i.e. 10th percentile, 20th percentile etc):
min_over_time(probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.1, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
quantile_over_time(0.2, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
...
quantile_over_time(0.9, probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
max_over_time(probe_icmp_duration_seconds{instance="$instance",module="ping4",phase="rtt"}[5m])
By adding some “Fill below to…” overrides in Grafana, you can get a display which is very similar to Smokeping!
Decreasing the interval of the scrape job can give a higher resolution of packet loss detection than Smokeping can provide, and faster notification of outages. The following query can detect when a line has been down for 30 seconds continuously:
min_over_time(probe_success[30s]) == 0
For fast polling of many targets, you may prefer to use the smokeping_prober which offloads the polling work from Prometheus itself, and summarises latency information in the form of histogram buckets.
There are other ways you can integrate service checks with prometheus.
It is possible to run nagios plugins unchanged using nrpe. First you have to install nrpe and configure it to run the plugins. Then you can talk to the nrpe daemon using nrpe_exporter.
This is fine for returning ok/warning/critical status, but will not return extra data from the plugin (such as textual result data and nagios metrics). Also, if you want to use nrpe’s weird flavour of SSL then you need to build nrpe_exporter yourself from source. This can be useful when talking to appliances like pfSense which support nrpe but don’t generate prometheus metrics.
It’s also possible to use exporter_exporter, which has the ability to run external scripts. Your existing nagios scripts would need to be modified to return data in prometheus metric exposition format.
Finally, you can write service checks which run periodically and write their results into a file for node-exporter’s textfile collector. This approach is described in the “custom metrics” exercise. It’s especially useful for service checks which are slow or expensive to run.