In this lab you’ll look at using Prometheus and Grafana to monitor your local Proxmox infrastructure, although it can also be used for cloud services.
Everyone in the cluster will be working together. Whenever a change is required to the Prometheus or Grafana configuration, you’ll need to coordinate between all the groups in your cluster who will make that change. However, everyone can use the web interfaces to browse data.
Don’t worry if you don’t get through this all! Treat it as reference material in case you want to deploy this another time.
There is a separate monitoring VM which lives outside your Proxmox
cluster; its hostname is monX.ws.nsrc.org
Everyone in the cluster can access the Prometheus web interface from their laptop. Use the drop-down menu from the lab web interface:

This will take you to the URL http://monX.ws.nsrc.org:9090/ (replacing “X” with your cluster number). To keep the lab simple, there is no HTTPS and no authentication.
Inside the PromQL query box, enter “up” and then click on Execute, like this:

You should see a table with a single metric:
up{instance="localhost:9090", job="prometheus"} 1
There is a single scrape job, which is prometheus scraping itself for metrics about its own state. The value is 1 which means the scrape was successful. You can also see scrape information if you select “Status > Targets” from the top menu.
Let’s look at one of the measurements it’s collecting. Go back to the main page by clicking “Prometheus” at the top left corner, then enter the query “scrape_duration_seconds” and click Execute again:

You should see a number, which is the duration of the most recent scrape, in seconds.
Now click on the “Graph” tab:

You should see a graph showing the history of this metric over the last hour, like this:

If you don’t see this, please ask an instructor for help.
Prometheus node_exporter has already been installed on all the nodes in your cluster. You’re now going to test this, and get prometheus to collect data.
Everyone can do this part.
Open an SSH command-line connection to your monitoring server, using the lab web interface.

This should open a terminal window and give you a prompt like this:
sysadm@mon1:~$
Now use curl to send a test scrape to one of the nodes
in your cluster:
curl nodeXY.ws.nsrc.org:9100/metrics
You can copy-paste, but substitute XY with any node in your cluster.
You should see screenfuls of metrics come back, in Prometheus/openmetrics format, like these:
...
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 364
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
Again, if you don’t, please ask for help.
The prometheus configuration file is at
/etc/prometheus/prometheus.yml and you can inspect it:
cat /etc/prometheus/prometheus.yml
Now, you should pick one person in your cluster to amend the configuration.
That person will need to change to root (sudo -s) and
then use an editor to edit /etc/prometheus/prometheus.yml.
Find the commented-out section which starts with
job_name: node
# - job_name: node
# file_sd_configs:
# - files:
# - /etc/prometheus/targets.d/node.yml
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# - source_labels: [__address__]
# target_label: __address__
# replacement: '${1}:9100'
Uncomment this by carefully removing the hash (only the hash, none of the spaces) from the front of each of those lines. Leave the later sections commented out.
IDENTATION IS IMPORTANT. There needs to be exactly two spaces before the dash in
- job_name: node, four spaces beforefile_sd_configs:, and everything needs to line up as it was before.
Now save this file, and edit
/etc/prometheus/targets.d/node.yml which looks like
this:
- targets:
- monX.ws.nsrc.org
- nodeX1.ws.nsrc.org
- nodeX2.ws.nsrc.org
- nodeX3.ws.nsrc.org
- nodeX4.ws.nsrc.org
- nodeX5.ws.nsrc.org
Change each “X” to your cluster number, then save the file.
Finally, tell prometheus to reload its configuration:
systemctl reload prometheus
Use reload not restart. Reload is a signal to re-read the configuration; if it’s invalid, prometheus will keep running with its old configuration. But restart will stop the daemon, and if the configuration is invalid it won’t be able to start.
Check the daemon status:
systemctl status prometheus
The last lines should include
msg="Completed loading of configuration file"
If you see
msg="Error reloading config"then there was something wrong with your configuration - you may get more information by pressing cursor right to scroll sideways. Go back and edit it and fix the problem, then try again. Ask for help if you need it.
If you are in a pager (lines 1-24/24 (END)) then press
‘q’ to quit.
Everyone can do this part.
Go to the Prometheus web interface. Check the “up” metric again:
up
Switch to Table view. You should now see multiple targets being scraped. Change the query to
up == 0
to check for any failures. The result should be empty; if it isn’t, you may get more info by going to the “Status > Targets” option in the menu bar.
To get useful information, you need to build a PromQL query. Here is an example that you can paste in:
rate(node_disk_written_bytes_total[30s])
Enter this query, and you should see a table of results, which is the rate of disk I/O writes in bytes per second. Then click the “Graph” tab and you should see this in a graphical form, over the last hour.

You can alter the query to adjust the graph. For example, to show just devices whose names begin with “sd”:
rate(node_disk_written_bytes_total{device=~"sd.*"}[30s])
The PromQL browser is a useful place to develop graphing and alerting queries, but it does mean you need to know the node_exporter metric names and how to write PromQL queries.
For normal users, you would create a dashboard which has the PromQL queries built in. Grafana is a tool for this.
This initial configuration should be done by ONE person in the cluster
This person should go to the Grafana web interface, using the menu
option from the workshop server labelled “monX grafana”, or by entering
http://monX.ws.nsrc.org:3000/ in their browser.
Log in with username “admin” and password “admin”. This will prompt them to choose a new password. Change it to the normal class password (the one we’ve been using to login to Proxmox)
Next, the same person needs to create a data source to connect Grafana to Prometheus. Select the hamburger at top left to open up the menu bar if necessary, then open Connections and click on Data sources:

In the Settings page:
http://localhost:9090

Scroll right down to the bottom, and click “Save & test”. If it shows “Sucessfully queried the Prometheus API” then you are good.
Now we are going to import a Grafana dashboard. Go back to the hamburger and click on Dashboards:

Next, click “New > Import”

Under “Find and import dashboards for common applications” enter the number “1860” and click Load.

1860 is the ID of a public dashboard called “Node Exporter Full” which was contributed to the grafana dashboards site
With some dashboards, it may say “Select a Prometheus data source”, in which case you need to select Prometheus from the drop-down. Otherwise, just go straight ahead and click Import.
You should now have a dashboard of node_exporter statistics!

Everyone in the cluster can now try this:
There is third-party prometheus-pve-exporter which connects to the Proxmox API, collects data and returns it as prometheus metrics.
The exporter is running already on the lab monitoring server, but it needs to be given an authentication token before it is able to talk to the Proxmox API. You will then need to configure prometheus to scrape it.
One person in the cluster should do this part - the others should watch.
Login to the Proxmox web interface (any node).
Go to Datacenter > Permissions > Users and click on Add.
prometheusNetManage (the actual password doesn’t
matter but must be 8 or more chars)NetManage
Next, go to Permissions (above “Users”), and click Add > User Permission
/prometheus@pvePVEAuditor (you’ll need to scroll a bit for
this)
Next, go to Permissions > API Tokens, and click Add.
prometheus@pvemonitoring
At this point you will be shown a Token ID and Secret. Copy the Secret and paste it somewhere save (e.g. notepad). You will need this shortly. If you lose it, you’ll have to delete and recreate the API token.
Finally, go to Permissions (above “Users”) and click Add > API Token Permission
/prometheus@pve!monitoringPVEAuditor
Now you need to ssh into the monitoring box, and become root
(sudo -s)
Edit the file /etc/prometheus/pve.yml, and replace the
string after token_value: with the Secret you obtained
earlier:
default:
user: prometheus@pve
token_name: monitoring
token_value: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Restart the exporter to pick up this file:
systemctl restart pve_exporter
systemctl status pve_exporter
Check it’s running. If not, there may be an error in the config file.
Now you can test the exporter using curl. Note that you have to quote the URL because of the ampersand symbols, and replace X with your cluster number.
curl 'localhost:9221/pve?cluster=1&node=0&target=nodeX1.ws.nsrc.org'
This should return cluster metrics, like
pve_node_info and pve_version_info. If it
doesn’t, check all your configs or ask for help.
We now need to get prometheus to scrape data from the pve exporter.
Again, one person should do this section.
If you are not still there, make an ssh connection to the monitoring
server, and become root (sudo -s)
Edit the file /etc/prometheus/prometheus.yml, and
uncomment the section for pve:
# - job_name: pve
# metrics_path: /pve
# file_sd_configs:
# - files:
# - /etc/prometheus/targets.d/pve.yml
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# - source_labels: [__address__]
# target_label: __param_target
# - target_label: __address__
# replacement: 'localhost:9221'
Again, be very careful just to remove one hash, and leave all the
indentation alone, so that for example there are two spaces before the
dash in the first line - job_name: pve
Edit the file /etc/prometheus/targets.d/pve.yml and
change all the X’s to your cluster number:
# Cluster metrics can be scraped from *any* node
- labels:
__param_cluster: '1'
__param_node: '0'
targets:
- nodeX1.ws.nsrc.org
# Node metrics (e.g. replication status) must be scraped from each node
- labels:
__param_cluster: '0'
__param_node: '1'
targets:
- nodeX1.ws.nsrc.org
- nodeX2.ws.nsrc.org
- nodeX3.ws.nsrc.org
- nodeX4.ws.nsrc.org
- nodeX5.ws.nsrc.org
As before, request prometheus to reload its configuration, and check that it was successful with no configuration errors:
systemctl reload prometheus
systemctl status prometheus
To check scraping, go into the Prometheus web interface, and enter this query:
pve_cluster_info
If it’s successful, you should see a metric like this (in “Table” view):
pve_cluster_info{id="cluster/clusterX", instance="nodeX1.ws.nsrc.org",
job="pve", nodes="5", quorate="1", version="5"} 1
If not, then check the up metric and “Status >
Targets” to see if the exporter is being scraped. Also look for errors
from the exporter itself with
journalctl -eu pve_exporter
Great, Prometheus is now collecting metrics from Proxmox VE!
Import grafana dashboard number 10347 (using exactly the same process as you used to import the node_exporter dashboard)
Everyone can login to grafana and explore the dashboard. This gives you visibility at the level of individual VMs and containers.
The Ceph manager has a built-in capability to export prometheus metrics which is very good for alerting on things like OSD states.
From the monitoring node, you can test it like this:
curl nodeX1.ws.nsrc.org:9283/metrics
If it doesn’t work, then you need to turn the metrics on. One person should get a root shell onto nodeX1 (which is where the Ceph mgr is running) and type this:
ceph mgr module enable prometheus
Then check curl from the monitoring node again. It can take a minute or so for it to start working.
Aside: once prometheus metrics are enabled, this also enables a Ceph command to see history of healthcheck alerts:
ceph healthcheck history ls # history of alerts
Again, one person should do this section.
Make an ssh connection to the monitoring server, and become root
(sudo -s)
Edit the file /etc/prometheus/prometheus.yml, and
uncomment the section for ceph:
# - job_name: ceph
# file_sd_configs:
# - files:
# - /etc/prometheus/targets.d/ceph.yml
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# - source_labels: [__address__]
# target_label: __address__
# replacement: '${1}:9283'
Again, be very careful just to remove one hash, and leave all the
indentation alone, so that for example there are two spaces before the
dash in the first line - job_name: ceph
Edit the file /etc/prometheus/targets.d/ceph.yml and
change X to your cluster number:
# Scrape the active ceph manager node
- targets:
- nodeX1.ws.nsrc.org
As before, request prometheus to reload its configuration, and check that it was successful with no configuration errors:
systemctl reload prometheus
systemctl status prometheus
To check scraping, go into the Prometheus web interface, and try these queries:
ceph_health_status
ceph_osd_up
If there is no result, then check the up metric and
“Status > Targets” to see if the cluster is being scraped.
Import grafana dashboard 2842 (using the same process as before)
Everyone can login to grafana and explore the dashboard.
There are several dashboards available for Ceph, and each seems to have various limitations; you might want to try some of the other published dashboards or create your own.
The Linstor controller has built-in prometheus metrics on port 3370, which are enabled by default. You can test this from the monitoring server command line, with this command:
curl nodeX1.ws.nsrc.org:3370/metrics
The metrics are mainly limited to pool sizes and API response times.
More interesting metrics are available from DRBD. These are available
if you install the drbd-reactor package as described in this
blog post. This has been done for you already, which you can test
like this:
curl nodeXY.ws.nsrc.org:9942/metrics
(Test this with any node other than nodeX1 in your cluster)
The process is very similar to before. Edit
/etc/prometheus/prometheus.yml and uncomment these two
sections:
# - job_name: linstor-controller
# file_sd_configs:
# - files:
# - /etc/prometheus/targets.d/linstor-controller.yml
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# - source_labels: [__address__]
# target_label: __address__
# replacement: '$1:3370'
# - job_name: linstor-node
# file_sd_configs:
# - files:
# - /etc/prometheus/targets.d/linstor-node.yml
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# - source_labels: [__address__]
# target_label: __address__
# replacement: '$1:9942'
Edit /etc/prometheus/targets.d/linstor-controller.yml
and change ‘X’ to your cluster number (two places):
# The scrape job which polls linstor controller must also set a "node" target label,
# which forces the actual "node" labels that come back to be renamed by prometheus to "exported_node"
# (this is required by the official dashboard)
- labels:
node: nodeX1
targets:
- nodeX1.ws.nsrc.org
Edit /etc/prometheus/targets.d/linstor-node.yml and
change ‘X’ to your cluster number, in two places for every node:
# The scrape job which polls drbd-reactor must be called "linstor-node",
# and it must also set a "node" target label for each, giving the node name
# by which the target is known to linstor.
# See https://forums.linbit.com/t/doc-note-deploying-linstor-with-proxmox/779/5
- labels:
node: nodeX2
targets:
- nodeX2.ws.nsrc.org
- labels:
node: nodeX3
targets:
- nodeX3.ws.nsrc.org
- labels:
node: nodeX4
targets:
- nodeX4.ws.nsrc.org
- labels:
node: nodeX5
targets:
- nodeX5.ws.nsrc.org
Reload prometheus and check:
systemctl reload prometheus
systemctl status prometheus
In the prometheus GUI, go to Status > Targets and check that the linstor-controller and linstor-node jobs are working.
If all is good, go to Grafana and install dashboard 15917 (“LINSTOR / DRBD”). Set the time picker at the top right to “Last 1 hour”

There’s a lot more that hasn’t been covered here, including:
There are more exercises and sample configs in the NSRC Network Monitoring and Management workshop.