node_exporter has a simple way to return extra metrics, making it easy to plug in your own scripts for collecting any data you wish.

1 Enable the textfile collector

Create a directory:

mkdir /var/lib/node_exporter

Edit /edit/default/node_exporter, and add --collector.textfile.directory=/var/lib/node_exporter to the list of options. If the options are currently empty then it will look like this:

OPTIONS='--collector.textfile.directory=/var/lib/node_exporter'

Restart node_exporter and check there are no errors:

systemctl restart node_exporter
journalctl -eu node_exporter

2 Make a test metric

Create a file /var/lib/node_exporter/workshop.prom containing the following:

workshop_student_is_happy{campus="campusX"} 1

Scrape your own node:

curl localhost:9100/metrics
curl -s localhost:9100/metrics | grep workshop

You should see your metric in the response, e.g.

# HELP workshop_student_is_happy Metric read from /var/lib/node_exporter/workshop.prom
# TYPE workshop_student_is_happy untyped
workshop_student_is_happy{campus="campusX"} 1

If you don’t see this, ask for help.

Once prometheus has scraped this, you should be able to query it in the prometheus web interface at http://oob.srv1.campusX.ws.nsrc.org/prometheus

Notice that “instance” and “job” labels are added automatically to the metric by prometheus. Hence it’s fine if many machines are all generating the same metric; they will end up as different time series.

3 Add new metrics

In general, the way to generate custom metrics is to run a cronjob which periodically writes metrics into a file under /var/lib/node_exporter, to be picked up on the next scrape. There are plenty of example scripts at https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

To test this, we’re going to install the script apt.sh which reports on how many package updates are available for your system.

Download and install the script:

cd /usr/local/bin
wget https://raw.githubusercontent.com/prometheus-community/node-exporter-textfile-collector-scripts/master/apt.sh
chmod +x apt.sh

Run the script to see what it does:

./apt.sh

It should return some metrics like this:

# HELP apt_upgrades_pending Apt package pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates",arch="all"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates",arch="amd64"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates,Ubuntu:18.04/bionic-security",arch="all"} 1
apt_upgrades_pending{origin="Ubuntu:18.04/bionic-updates,Ubuntu:18.04/bionic-security",arch="amd64"} 8
apt_upgrades_pending{origin="grafanastable:stable",arch="amd64"} 1
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0

Now we just need to create a cronjob which writes this to a file. Create /etc/cron.d/prom-apt with the following contents:

* * * * * root /usr/local/bin/apt.sh >/var/lib/node_exporter/apt.prom.new && mv /var/lib/node_exporter/apt.prom.new /var/lib/node_exporter/apt.prom

This will run every minute, and should automatically create /var/lib/node_exporter/apt.prom. Check that it has done so:

cd /var/lib/node_exporter
ls

When you see this has happened, edit /etc/cron.d/prom-apt again and change it so that it only runs once a day at 1am:

0 1 * * * root /usr/local/bin/apt.sh >/var/lib/node_exporter/apt.prom.new && mv /var/lib/node_exporter/apt.prom.new /var/lib/node_exporter/apt.prom

This is to avoid unnecessary load on the server.

In the prometheus web interface, try the following queries:

apt_upgrades_pending
node_textfile_mtime_seconds - this shows the timestamp when the metric file was last updated
time() - node_textfile_mtime_seconds - this shows how old the metric is, in seconds

time() is a promQL built-in function.

4 Other approaches

Here are some other ways to get metrics into prometheus:

Use exporter_exporter, which has the ability to run external scripts which return metrics
Write your own exporter (which has a http interface)
Add prometheus instrumentation to your existing applications, using the prometheus client libraries
Use statsd_exporter to maintain counters, which your application can increment by sending statsd messages
Use pushgateway for short-lived tasks (e.g. backup jobs) which wish to leave their status around to be scraped later. (Hint: don’t attempt to put a timestamp on the metrics. Instead, export the Unix time that the job succeeded or failed as a separate metric)

5 Cardinality

A prometheus metric has a single value which is just a (floating-point) number. But a prometheus timeseries is identified by a set of labels, which are strings.

When creating a new metric, you need to choose the labels carefully. In particular, you should be aware of the number of different values a label can have - known as its “cardinality”.

A good label will have only a limited range of different values - perhaps up to a few dozen. A bad label will have a very large number of values.

The problem with high cardinality labels is that every distinct combination of labels creates a new timeseries - so a label with millions of distinct values can result in prometheus building millions of distinct timeseries. Each timeseries requires RAM to ingest, and queries which touch many timeseries will have to read many blocks on disk, making them very inefficient.

Here is an example of a good label:

{level="debug"}
{level="info"}
{level="warning"}

This only has 3 possible values. Another good label:

{instance="my-host.my-domain.com"}

You have a limited number of hosts in your domain, and in any case you want to be able to drill down to timeseries for a given host.

An example of a potentially bad label:

{src_ip="192.0.2.1"}

If src_ip could be the address of anyone on the Internet, it could have nearly 4 billion different values. On the other hand, if you know it could only be addresses in your own campus’ netblock, it might be acceptable.

An example of a definitely bad label:

{timestamp="Mar  2 15:50:28"}

This forces a separate timeseries for every second: hence at least 86,400 distinct timeseries per day, multiplied by the number of other different label combinations you have. Don’t do this!

In general: don’t put text inside labels which changes frequently, or (worse) comes from user input that you don’t control.

6 Further reading

https://www.robustperception.io/nagios-nrpe-prometheus-exporter
https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusStatsdForMetricsUpdates
https://www.robustperception.io/on-the-naming-of-things/
https://www.robustperception.io/target-labels-not-metric-name-prefixes
https://www.robustperception.io/cardinality-is-key
https://www.robustperception.io/using-sample_limit-to-avoid-overload
https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality
https://www.robustperception.io/how-many-metrics-should-an-application-return