To collect metrics from Linux services, such as memory/CPU/disk utilisation, you install the node_exporter on each server. This is much easier to configure and manage than SNMP.
Do this on your campus server instance (srv1.campusX.ws.nsrc.org)
(If node_exporter is pre-installed and you are not upgrading, skip to the next section “Start node exporter”)
You should go here first and then replace the “X”, Y" and “Z” below in the link with the current Prometheus version number that is listed on the releases page (scroll down a bit to find the link).
We are assuming you are doing these exercises as the root
user or using the sudo
command as needed.
Fetch and unpack the latest release from the releases page and create a symlink so that /opt/node_exporter
refers to the current version.
# cd /root
# wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz
# tar -C /opt -xvzf node_exporter-X.Y.Z.linux-amd64.tar.gz
If node exporter was already installed, then remove the symbolic link first:
# cd /opt
# rm node_exporter
And, now create the symbolic link to the current node_exporter version:
# ln -s node_exporter-X.Y.Z.linux-amd64 /opt/node_exporter
The symbolic link allows us to create systemd unit file without needing to update it each time we upgrade node_exporter.
Stop here if you are upgrading and go to the Start node_exporter section
Use a text editor to create a systemd unit file /etc/systemd/system/node_exporter.service
with the following contents:
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target
[Service]
User=root
EnvironmentFile=/etc/default/node_exporter
ExecStart=/opt/node_exporter/node_exporter $OPTIONS
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Tell systemd to read this new file:
# systemctl daemon-reload
Also create an options file /etc/default/node_exporter
with the following contents:
OPTIONS=''
Let’s start node_exporter:
# systemctl enable node_exporter # start on future boots
# systemctl restart node_exporter # restart now (works for upgrade and new)
# journalctl -eu node_exporter # check for msg="Listening on" address=:9100
Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.
This command shows you the status and just the last few lines of log output:
# systemctl status node_exporter # check for "Active: active (running)" text
The exporter is now running and listening on port 9100. Test it by doing a manual scrape:
# curl localhost:9100/metrics
You can filter the output to look at just a subset of metrics, like this:
# curl -sS localhost:9100/metrics | grep filesystem
(-s
= “silent”, stops curl showing the progress information. -S
= “show error”, overrides the silence if there’s a problem)
Now we will configure prometheus to scrape this host. As we want to be able to scrape many hosts, we’ll put the list of targets to scrape in a separate file to make it easier to manage.
Create a new directory:
# mkdir /etc/prometheus/targets.d
Create a new file /etc/prometheus/targets.d/node.yml
with the following contents:
- targets:
- 'srv1.campusX.ws.nsrc.org:9100'
Edit /etc/prometheus/prometheus.yml
. Under the scrape_configs
section paste the following lines to create a new scrape job called “node”.
- job_name: 'node'
file_sd_configs:
- files:
- /etc/prometheus/targets.d/node.yml
Now the whole scrape_configs
part of the config should look like this:
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
metrics_path: '/prometheus/metrics'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
file_sd_configs:
- files:
- /etc/prometheus/targets.d/node.yml
Be careful with spacing. In particular, the hyphens before each job_name:
must be in exactly the same column, so they line up vertically. You need two spaces before - job_name:
Now tell prometheus that you have changed its configuration:
# systemctl reload prometheus
# journalctl -eu prometheus
CHECK THE LOGS YOU SEE. Use cursor keys to scroll if necessary. You should see:
msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
If you don’t, then find your error and correct it. In the mean time, prometheus will continue running with your old configuration. You may get a more helpful description of the error by running:
/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml
REMEMBER: don’t restart prometheus after changing its configuration. If you do, it will force prometheus to re-read all its Write-Ahead Logs (WAL) which in a large running installation can take many minutes. And if you have an error in your configuration, prometheus won’t start at all until you fix it.
Return to the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus, or from the virtual training platform front page, select Web > srv1 under your campus, then click on Prometheus.
In the query box (click on “Graph” option at the top of the page) start typing “node” and you’ll get a list of matching metrics.
Select “node_filesystem_avail_bytes” as an example of a gauge. Look at the current values, and look at the graph. The lines may look very flat, but as you move the mouse along them, you may see the values changing slightly.
Now change your query to select a single timeseries, by changing the query to:
node_filesystem_avail_bytes{mountpoint="/"}
This may give a clearer picture of the free disk space reducing over time.
Finally: there is a metric called “up” which shows the success or failure of scrape jobs. Enter “up” in the query, select Execute, and select the “Table” tab. You should see something like this:
Element | Value |
---|---|
up{instance="localhost:9090",job="prometheus"} |
1 |
up{instance="srv1.campusX.ws.nsrc.org:9100",job="node"} |
1 |
This means you are scraping two targets, and they both responded successfully to the scrape.
More information about the health of scraping, including any scrape errors, can be seen by using the Status > Targets
menu option.
Run the following query in the prometheus web UI:
node_network_transmit_bytes_total
Look at it in in Table and Graph views. These are counter values which are increasing. But for network traffic, what you’re actually interested in is the rate in bytes per second or bits per second, not the raw counter values.
Prometheus has built-in functions which can convert counters into rates. Modify your query as follows, and look at the Graph again:
rate(node_network_transmit_bytes_total[2m])
What do you see now?
How this query works:
node_network_transmit_bytes_total[2m]
covers a range of data points collected over the last 2 minutesrate(...)
takes the first and last data point in that range, and calculates the average rate over that period.Given that we are sampling at 15 second intervals, the first and last data points will be 1 minute and 45 seconds apart, so this gives a “smoothed” rate averaged over that period.
Now change “rate” to “irate”, so that the query looks like this:
irate(node_network_transmit_bytes_total[2m])
Again, this covers a range of data points over 2 minutes. But irate takes the last two data points in that period, giving an “instantaneous” rate - this will be more spiky as it reacts very quickly to changes.
Which you use depends on the context. irate
gives you a more accurate representation of the rate at the highest resolution possible, but it may miss spikes entirely when you zoom out to a long time window. Also, for alerting purposes the smoothed value may subject you to fewer false alarms.
Finally, note that these graphs are in Bytes per Second. Normally for network traffic you want Bits per Second. This is easy to fix: just scale up the output by a factor of 8.
irate(node_network_transmit_bytes_total[2m]) * 8
Now update your configuration to monitor the other servers in the class. All you need to do is edit /etc/prometheus/targets.d/node.yml
so that it looks like this:
- targets:
- 'noc.ws.nsrc.org:9100'
- 'srv1.campus1.ws.nsrc.org:9100'
- 'srv1.campus2.ws.nsrc.org:9100'
- 'srv1.campus3.ws.nsrc.org:9100'
- 'srv1.campus4.ws.nsrc.org:9100'
- 'srv1.campus5.ws.nsrc.org:9100'
- 'srv1.campus6.ws.nsrc.org:9100'
Save the changes to this file. Note that for this change you do not need to reload prometheus! Prometheus is constantly monitoring the targets file and will pick up changes by itself.
Return to the web interface. Of course, you may have gotten here before the other campuses have got their node exporters working. How can you tell which ones you are successfully scraping?
Easy: execute a query on “up” and see which targets are up (1) or down (0). This will be easiest to see in the “Table” view.
Try shutting down your own node_exporter:
systemctl stop node_exporter
# wait 30 seconds
systemctl start node_exporter
Now run the query “up” again. Using the Graph view, you should be able to see the history of when your node went down, and came back up again.
At this point, you may find queries are returning more results than are useful. For example, try “node_filesystem_avail_bytes” and you’ll see all the filesystems across all the hosts. As your network gets bigger, such queries become very slow.
So you should start getting into the habit of filtering your queries to limit the number of timeseries returned. Try the following:
node_filesystem_avail_bytes{instance="srv1.campus1.ws.nsrc.org:9100"}
You can also use regular expression pattern matches against labels. Try this:
node_filesystem_avail_bytes{instance=~".*campus1.*"}
This gets tedious, but once you have developed a useful query, you would normally create it as a permanent dashboard in a tool like grafana - that will be a later exercise.
If you have spare time: deploy node_exporter to the other hosts in your own campus (host1-host6.campusX.ws.nsrc.org) and scrape them from your prometheus server.
In real life, to install node_exporter across many servers, you could put all the components into a tarball, or use a configuration management tool such as ansible.
On Windows targets, you would install the windows_exporter rather than node_exporter. It runs on a different port (9182), so you would create a separate scrape job for all your Windows machines, with its own targets file.