Prometheus and Grafana lab

Infrastructure Monitoring lab

In this lab you’ll look at using Prometheus and Grafana to monitor your local Proxmox infrastructure, although it can also be used for cloud services.

Everyone in the cluster will be working together. Whenever a change is required to the Prometheus or Grafana configuration, you’ll need to coordinate between all the groups in your cluster who will make that change. However, everyone can use the web interfaces to browse data.

Don’t worry if you don’t get through this all! Treat it as reference material in case you want to deploy this another time.

Access the Prometheus web interface

There is a separate monitoring VM which lives outside your Proxmox cluster; its hostname is monX.ws.nsrc.org

Everyone in the cluster can access the Prometheus web interface from their laptop. Use the drop-down menu from the lab web interface:

This will take you to the URL http://monX.ws.nsrc.org:9090/ (replacing “X” with your cluster number). To keep the lab simple, there is no HTTPS and no authentication.

Inside the PromQL query box, enter “up” and then click on Execute, like this:

You should see a table with a single metric:

up{instance="localhost:9090", job="prometheus"} 1

There is a single scrape job, which is prometheus scraping itself for metrics about its own state. The value is 1 which means the scrape was successful. You can also see scrape information if you select “Status > Targets” from the top menu.

Let’s look at one of the measurements it’s collecting. Go back to the main page by clicking “Prometheus” at the top left corner, then enter the query “scrape_duration_seconds” and click Execute again:

You should see a number, which is the duration of the most recent scrape, in seconds.

Now click on the “Graph” tab:

You should see a graph showing the history of this metric over the last hour, like this:

If you don’t see this, please ask an instructor for help.

Part 1: Collect host metrics from node_exporter

Prometheus node_exporter has already been installed on all the nodes in your cluster. You’re now going to test this, and get prometheus to collect data.

Testing node exporter

Everyone can do this part.

Open an SSH command-line connection to your monitoring server, using the lab web interface.

This should open a terminal window and give you a prompt like this:

sysadm@mon1:~$

Now use curl to send a test scrape to one of the nodes in your cluster:

curl nodeXY.ws.nsrc.org:9100/metrics

You can copy-paste, but substitute XY with any node in your cluster.

You should see screenfuls of metrics come back, in Prometheus/openmetrics format, like these:

...
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 364
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Again, if you don’t, please ask for help.

Configuring prometheus

The prometheus configuration file is at /etc/prometheus/prometheus.yml and you can inspect it:

cat /etc/prometheus/prometheus.yml

Now, you should pick one person in your cluster to amend the configuration.

That person will need to change to root (sudo -s) and then use an editor to edit /etc/prometheus/prometheus.yml. Find the commented-out section which starts with job_name: node

#  - job_name: node
#    file_sd_configs:
#      - files:
#          - /etc/prometheus/targets.d/node.yml
#    relabel_configs:
#      - source_labels: [__address__]
#        target_label: instance
#      - source_labels: [__address__]
#        target_label: __address__
#        replacement: '${1}:9100'

Uncomment this by carefully removing the hash (only the hash, none of the spaces) from the front of each of those lines. Leave the later sections commented out.

IDENTATION IS IMPORTANT. There needs to be exactly two spaces before the dash in - job_name: node, four spaces before file_sd_configs:, and everything needs to line up as it was before.

Now save this file, and edit /etc/prometheus/targets.d/node.yml which looks like this:

- targets:
    - monX.ws.nsrc.org
    - nodeX1.ws.nsrc.org
    - nodeX2.ws.nsrc.org
    - nodeX3.ws.nsrc.org
    - nodeX4.ws.nsrc.org
    - nodeX5.ws.nsrc.org

Change each “X” to your cluster number, then save the file.

Finally, tell prometheus to reload its configuration:

systemctl reload prometheus

Use reload not restart. Reload is a signal to re-read the configuration; if it’s invalid, prometheus will keep running with its old configuration. But restart will stop the daemon, and if the configuration is invalid it won’t be able to start.

Check the daemon status:

systemctl status prometheus

The last lines should include msg="Completed loading of configuration file"

If you see msg="Error reloading config" then there was something wrong with your configuration - you may get more information by pressing cursor right to scroll sideways. Go back and edit it and fix the problem, then try again. Ask for help if you need it.

If you are in a pager (lines 1-24/24 (END)) then press ‘q’ to quit.

Examine node_exporter data

Everyone can do this part.

Go to the Prometheus web interface. Check the “up” metric again:

up

Switch to Table view. You should now see multiple targets being scraped. Change the query to

up == 0

to check for any failures. The result should be empty; if it isn’t, you may get more info by going to the “Status > Targets” option in the menu bar.

To get useful information, you need to build a PromQL query. Here is an example that you can paste in:

rate(node_disk_written_bytes_total[30s])

Enter this query, and you should see a table of results, which is the rate of disk I/O writes in bytes per second. Then click the “Graph” tab and you should see this in a graphical form, over the last hour.

You can alter the query to adjust the graph. For example, to show just devices whose names begin with “sd”:

rate(node_disk_written_bytes_total{device=~"sd.*"}[30s])

Import grafana dashboard

The PromQL browser is a useful place to develop graphing and alerting queries, but it does mean you need to know the node_exporter metric names and how to write PromQL queries.

For normal users, you would create a dashboard which has the PromQL queries built in. Grafana is a tool for this.

This initial configuration should be done by ONE person in the cluster

This person should go to the Grafana web interface, using the menu option from the workshop server labelled “monX grafana”, or by entering http://monX.ws.nsrc.org:3000/ in their browser.

Log in with username “admin” and password “admin”. This will prompt them to choose a new password. Change it to the normal class password (the one we’ve been using to login to Proxmox)

Next, the same person needs to create a data source to connect Grafana to Prometheus. Select the hamburger at top left to open up the menu bar if necessary, then open Connections and click on Data sources:

In the Settings page:

Scroll right down to the bottom, and click “Save & test”. If it shows “Sucessfully queried the Prometheus API” then you are good.

Now we are going to import a Grafana dashboard. Go back to the hamburger and click on Dashboards:

Next, click “New > Import”

Under “Find and import dashboards for common applications” enter the number “1860” and click Load.

1860 is the ID of a public dashboard called “Node Exporter Full” which was contributed to the grafana dashboards site

With some dashboards, it may say “Select a Prometheus data source”, in which case you need to select Prometheus from the drop-down. Otherwise, just go straight ahead and click Import.

You should now have a dashboard of node_exporter statistics!

Explore grafana dashboard

Everyone in the cluster can now try this:

Part 2: Collect Proxmox VE metrics

There is third-party prometheus-pve-exporter which connects to the Proxmox API, collects data and returns it as prometheus metrics.

The exporter is running already on the lab monitoring server, but it needs to be given an authentication token before it is able to talk to the Proxmox API. You will then need to configure prometheus to scrape it.

Connect prometheus-pve-exporter to PVE API

One person in the cluster should do this part - the others should watch.

Login to the Proxmox web interface (any node).

Go to Datacenter > Permissions > Users and click on Add.

Next, go to Permissions (above “Users”), and click Add > User Permission

Next, go to Permissions > API Tokens, and click Add.

At this point you will be shown a Token ID and Secret. Copy the Secret and paste it somewhere save (e.g. notepad). You will need this shortly. If you lose it, you’ll have to delete and recreate the API token.

Finally, go to Permissions (above “Users”) and click Add > API Token Permission

Now you need to ssh into the monitoring box, and become root (sudo -s)

Edit the file /etc/prometheus/pve.yml, and replace the string after token_value: with the Secret you obtained earlier:

default:
  user: prometheus@pve
  token_name: monitoring
  token_value: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Restart the exporter to pick up this file:

systemctl restart pve_exporter
systemctl status pve_exporter

Check it’s running. If not, there may be an error in the config file.

Now you can test the exporter using curl. Note that you have to quote the URL because of the ampersand symbols, and replace X with your cluster number.

curl 'localhost:9221/pve?cluster=1&node=0&target=nodeX1.ws.nsrc.org'

This should return cluster metrics, like pve_node_info and pve_version_info. If it doesn’t, check all your configs or ask for help.

Configure prometheus

We now need to get prometheus to scrape data from the pve exporter.

Again, one person should do this section.

If you are not still there, make an ssh connection to the monitoring server, and become root (sudo -s)

Edit the file /etc/prometheus/prometheus.yml, and uncomment the section for pve:

#  - job_name: pve
#    metrics_path: /pve
#    file_sd_configs:
#      - files:
#          - /etc/prometheus/targets.d/pve.yml
#    relabel_configs:
#      - source_labels: [__address__]
#        target_label: instance
#      - source_labels: [__address__]
#        target_label: __param_target
#      - target_label: __address__
#        replacement: 'localhost:9221'

Again, be very careful just to remove one hash, and leave all the indentation alone, so that for example there are two spaces before the dash in the first line - job_name: pve

Edit the file /etc/prometheus/targets.d/pve.yml and change all the X’s to your cluster number:

# Cluster metrics can be scraped from *any* node
- labels:
    __param_cluster: '1'
    __param_node: '0'
  targets:
    - nodeX1.ws.nsrc.org

# Node metrics (e.g. replication status) must be scraped from each node
- labels:
    __param_cluster: '0'
    __param_node: '1'
  targets:
    - nodeX1.ws.nsrc.org
    - nodeX2.ws.nsrc.org
    - nodeX3.ws.nsrc.org
    - nodeX4.ws.nsrc.org
    - nodeX5.ws.nsrc.org

As before, request prometheus to reload its configuration, and check that it was successful with no configuration errors:

systemctl reload prometheus
systemctl status prometheus

To check scraping, go into the Prometheus web interface, and enter this query:

pve_cluster_info

If it’s successful, you should see a metric like this (in “Table” view):

pve_cluster_info{id="cluster/clusterX", instance="nodeX1.ws.nsrc.org",
                 job="pve", nodes="5", quorate="1", version="5"} 1

If not, then check the up metric and “Status > Targets” to see if the exporter is being scraped. Also look for errors from the exporter itself with

journalctl -eu pve_exporter

Import grafana dashboard

Great, Prometheus is now collecting metrics from Proxmox VE!

Import grafana dashboard number 10347 (using exactly the same process as you used to import the node_exporter dashboard)

Explore

Everyone can login to grafana and explore the dashboard. This gives you visibility at the level of individual VMs and containers.

Part 3: Collect Ceph metrics

Enable metrics

The Ceph manager has a built-in capability to export prometheus metrics which is very good for alerting on things like OSD states.

From the monitoring node, you can test it like this:

curl nodeX1.ws.nsrc.org:9283/metrics

If it doesn’t work, then you need to turn the metrics on. One person should get a root shell onto nodeX1 (which is where the Ceph mgr is running) and type this:

ceph mgr module enable prometheus

Then check curl from the monitoring node again. It can take a minute or so for it to start working.

Aside: once prometheus metrics are enabled, this also enables a Ceph command to see history of healthcheck alerts:

ceph healthcheck history ls      # history of alerts

Configure prometheus

Again, one person should do this section.

Make an ssh connection to the monitoring server, and become root (sudo -s)

Edit the file /etc/prometheus/prometheus.yml, and uncomment the section for ceph:

#  - job_name: ceph
#    file_sd_configs:
#      - files:
#          - /etc/prometheus/targets.d/ceph.yml
#    relabel_configs:
#      - source_labels: [__address__]
#        target_label: instance
#      - source_labels: [__address__]
#        target_label: __address__
#        replacement: '${1}:9283'

Again, be very careful just to remove one hash, and leave all the indentation alone, so that for example there are two spaces before the dash in the first line - job_name: ceph

Edit the file /etc/prometheus/targets.d/ceph.yml and change X to your cluster number:

# Scrape the active ceph manager node
- targets:
    - nodeX1.ws.nsrc.org

As before, request prometheus to reload its configuration, and check that it was successful with no configuration errors:

systemctl reload prometheus
systemctl status prometheus

To check scraping, go into the Prometheus web interface, and try these queries:

ceph_health_status
ceph_osd_up

If there is no result, then check the up metric and “Status > Targets” to see if the cluster is being scraped.

Import grafana dashboard

Import grafana dashboard 2842 (using the same process as before)

Explore dashboard

Everyone can login to grafana and explore the dashboard.

There are several dashboards available for Ceph, and each seems to have various limitations; you might want to try some of the other published dashboards or create your own.

Part 4: Collect Linstor and DRBD metrics

The Linstor controller has built-in prometheus metrics on port 3370, which are enabled by default. You can test this from the monitoring server command line, with this command:

curl nodeX1.ws.nsrc.org:3370/metrics

The metrics are mainly limited to pool sizes and API response times.

More interesting metrics are available from DRBD. These are available if you install the drbd-reactor package as described in this blog post. This has been done for you already, which you can test like this:

curl nodeXY.ws.nsrc.org:9942/metrics

(Test this with any node other than nodeX1 in your cluster)

The process is very similar to before. Edit /etc/prometheus/prometheus.yml and uncomment these two sections:

#  - job_name: linstor-controller
#    file_sd_configs:
#      - files:
#          - /etc/prometheus/targets.d/linstor-controller.yml
#    relabel_configs:
#      - source_labels: [__address__]
#        target_label: instance
#      - source_labels: [__address__]
#        target_label: __address__
#        replacement: '$1:3370'

#  - job_name: linstor-node
#    file_sd_configs:
#      - files:
#          - /etc/prometheus/targets.d/linstor-node.yml
#    relabel_configs:
#      - source_labels: [__address__]
#        target_label: instance
#      - source_labels: [__address__]
#        target_label: __address__
#        replacement: '$1:9942'

Edit /etc/prometheus/targets.d/linstor-controller.yml and change ‘X’ to your cluster number (two places):

# The scrape job which polls linstor controller must also set a "node" target label,
# which forces the actual "node" labels that come back to be renamed by prometheus to "exported_node"
# (this is required by the official dashboard)
- labels:
    node: nodeX1
  targets:
    - nodeX1.ws.nsrc.org

Edit /etc/prometheus/targets.d/linstor-node.yml and change ‘X’ to your cluster number, in two places for every node:

# The scrape job which polls drbd-reactor must be called "linstor-node",
# and it must also set a "node" target label for each, giving the node name
# by which the target is known to linstor.
# See https://forums.linbit.com/t/doc-note-deploying-linstor-with-proxmox/779/5
- labels:
    node: nodeX2
  targets:
    - nodeX2.ws.nsrc.org

- labels:
    node: nodeX3
  targets:
    - nodeX3.ws.nsrc.org

- labels:
    node: nodeX4
  targets:
    - nodeX4.ws.nsrc.org

- labels:
    node: nodeX5
  targets:
    - nodeX5.ws.nsrc.org

Reload prometheus and check:

systemctl reload prometheus
systemctl status prometheus

In the prometheus GUI, go to Status > Targets and check that the linstor-controller and linstor-node jobs are working.

If all is good, go to Grafana and install dashboard 15917 (“LINSTOR / DRBD”). Set the time picker at the top right to “Last 1 hour”

Future work

There’s a lot more that hasn’t been covered here, including:

There are more exercises and sample configs in the NSRC Network Monitoring and Management workshop.