node_exporter has a simple way to return extra metrics, making it easy to plug in your own scripts for collecting any data you wish.
Create a directory:
mkdir /var/lib/node_exporter
Edit /edit/default/node_exporter
, and add --collector.textfile.directory=/var/lib/node_exporter
to the list of options. If the options are currently empty then it will look like this:
OPTIONS='--collector.textfile.directory=/var/lib/node_exporter'
Restart node_exporter and check there are no errors:
systemctl restart node_exporter
journalctl -eu node_exporter
Create a file /var/lib/node_exporter/workshop.prom
containing the following:
workshop_student_is_happy{campus="campusX"} 1
Scrape your own node:
curl localhost:9100/metrics
curl -s localhost:9100/metrics | grep workshop
You should see your metric in the response, e.g.
# HELP workshop_student_is_happy Metric read from /var/lib/node_exporter/workshop.prom
# TYPE workshop_student_is_happy untyped
workshop_student_is_happy{campus="campusX"} 1
If you don’t see this, ask for help.
Once prometheus has scraped this, you should be able to query it in the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus
Notice that “instance” and “job” labels are added automatically to the metric by prometheus. Hence it’s fine if many machines are all generating the same metric; they will end up as different time series.
In general, the way to generate custom metrics is to run a cronjob which periodically writes metrics into a file under /var/lib/node_exporter
, to be picked up on the next scrape. There are plenty of example scripts at https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
To test this, we’re going to install the script apt.sh
which reports on how many package updates are available for your system.
Download and install the script:
cd /usr/local/bin
wget https://raw.githubusercontent.com/prometheus-community/node-exporter-textfile-collector-scripts/master/apt.sh
chmod +x apt.sh
Run the script to see what it does:
./apt.sh
It should return some metrics like this:
# HELP apt_upgrades_pending Apt package pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates",arch="all"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates",arch="amd64"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates,Ubuntu:18.04/bionic-security",arch="all"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates,Ubuntu:18.04/bionic-security",arch="amd64"} 8
apt_upgrades_pending{origin="grafanastable:stable",arch="amd64"} 1
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0
Now we just need to create a cronjob which writes this to a file. Create /etc/cron.d/prom-apt
with the following contents:
* * * * * root /usr/local/bin/apt.sh >/var/lib/node_exporter/apt.prom.new && mv /var/lib/node_exporter/apt.prom.new /var/lib/node_exporter/apt.prom
This will run every minute, and should automatically create /var/lib/node_exporter/apt.prom
. Check that it has done so:
cd /var/lib/node_exporter
ls
When you see this has happened, edit /etc/cron.d/prom-apt
again and change it so that it only runs once a day at 1am:
0 1 * * * root /usr/local/bin/apt.sh >/var/lib/node_exporter/apt.prom.new && mv /var/lib/node_exporter/apt.prom.new /var/lib/node_exporter/apt.prom
This is to avoid unnecessary load on the server.
In the prometheus web interface, try the following queries:
apt_upgrades_pending
node_textfile_mtime_seconds
- this shows the timestamp when the metric file was last updatedtime() - node_textfile_mtime_seconds
- this shows how old the metric is, in secondstime()
is a promQL built-in function.
Here are some other ways to get metrics into prometheus:
A prometheus metric has a single value which is just a (floating-point) number. But a prometheus timeseries is identified by a set of labels, which are strings.
When creating a new metric, you need to choose the labels carefully. In particular, you should be aware of the number of different values a label can have - known as its “cardinality”.
A good label will have only a limited range of different values - perhaps up to a few dozen. A bad label will have a very large number of values.
The problem with high cardinality labels is that every distinct combination of labels creates a new timeseries - so a label with millions of distinct values can result in prometheus building millions of distinct timeseries. Each timeseries requires RAM to ingest, and queries which touch many timeseries will have to read many blocks on disk, making them very inefficient.
Here is an example of a good label:
{level="debug"}
{level="info"}
{level="warning"}
This only has 3 possible values. Another good label:
{instance="my-host.my-domain.com"}
You have a limited number of hosts in your domain, and in any case you want to be able to drill down to timeseries for a given host.
An example of a potentially bad label:
{src_ip="192.0.2.1"}
If src_ip
could be the address of anyone on the Internet, it could have nearly 4 billion different values. On the other hand, if you know it could only be addresses in your own campus’ netblock, it might be acceptable.
An example of a definitely bad label:
{timestamp="Mar 2 15:50:28"}
This forces a separate timeseries for every second: hence at least 86,400 distinct timeseries per day, multiplied by the number of other different label combinations you have. Don’t do this!
In general: don’t put text inside labels which changes frequently, or (worse) comes from user input that you don’t control.