Prometheus

Network Monitoring and Management

Configuring Prometheus Data Store and Central Server ====================-==============================

0.1 Introduction

Prometheus provides a data collection engine, a time series storage engine, and a basic web interface for querying. In this exercise we’ll install Prometheus.

Do this on your campus server instance (srv1.campusX.ws.nsrc.org)

1 Install (or upgrade) prometheus

(If prometheus is pre-installed and you are not upgrading, skip to the next section “Start prometheus”)

All the prometheus components are available as pre-built binaries.

Fetch and unpack the latest release from the releases page. You should go here first and then replace the “XX” and the “Y” below in the link with the current Prometheus version number that is listed on the releases page (scroll down a bit to find the link).

You should do these exercises as the root user on your srv1.campusX.ws.nsrc.org box. Start by connecting to your box, then:

$ sudo -s
# cd /root

Then do,

# wget https://github.com/prometheus/prometheus/releases/download/v2.20.1/prometheus-2.XX.Y.linux-amd64.tar.gz
# tar -C /opt -xvzf prometheus-2.XX.Y.linux-amd64.tar.gz
# cd /opt

This may take a few minutes depending on your network speed.

Create a symbolic link from /opt/prometheus pointing to the version you downloaded (this makes it easier to upgrade and switch between versions). Remeber to replace “XX” and “Y” with the version of prometheus you have downloaded.

If the prometheus link already exists and you are upgrading remove it first:

# rm prometheus

Then,

# ln -s prometheus-2.XX.Y.linux-amd64 /opt/prometheus

The logical link allows us to create systemd unit file without needing to update it each time we upgrade Prometheus.

Stop here if you are upgrading and go to the Start prometheus section

If you are installing Prometheus for the first time, then create a user for prometheus to run as, and a data directory:

# useradd --system -d /var/lib/prometheus prometheus
# mkdir /var/lib/prometheus
# chown prometheus:prometheus /var/lib/prometheus

Use a text editor to create a systemd unit file /etc/systemd/system/prometheus.service with the following contents:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
RestartSec=5
TimeoutStopSec=300
WorkingDirectory=/opt/prometheus
EnvironmentFile=/etc/default/prometheus
ExecStart=/opt/prometheus/prometheus $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Tell systemd to read this new file:

# systemctl daemon-reload

Also create an options file /etc/default/prometheus with the following contents:

OPTIONS='--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/data/ --storage.tsdb.wal-compression --web.external-url=http://srv1.campusX.ws.nsrc.org/prometheus'

(Adjust campusX as appropriate). One of the consequences of this is that the prometheus API will response to /prometheus instead of /, which allows us to proxy easily from the Apache web server.

Create the initial default configuration:

# mkdir /etc/prometheus
# cp /opt/prometheus/prometheus.yml /etc/prometheus/

Edit /etc/prometheus/prometheus.yml and under job_name: 'prometheus' add:

    metrics_path: '/prometheus/metrics'

2 Start prometheus

Let’s start prometheus:

# systemctl enable prometheus   # start on future boots
# systemctl start prometheus    # start now
# journalctl -eu prometheus     # check for "Server is ready to receive web requests."

Use cursor keys to move around the journalctl log output, and “q” to quit. If there are any errors, then go back and fix them.

In the future you can type:

# systemctl status prometheus   # check for "Active: active (running)" text

3 The default configuration

Have a look at the default configuration file:

# cat /etc/prometheus/prometheus.yml

Note that the scrape interval is currently set to 15 seconds. This gives us rapid results, but in the real world, 1 minute is probably a safer starting point. (We’ll see how well the class runs with a 15 second scrape interval!)

It contains a single scrape job: it is scraping itself, to collect metrics about prometheus’ own internal operation.

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    metrics_path: '/prometheus/metrics'

    static_configs:
    - targets: ['localhost:9090']

Prometheus metrics are human-readable. Run the following command to scrape the metrics yourself:

# curl localhost:9090/prometheus/metrics

You should see several screensful of data such as

# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 18
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

This is exactly what prometheus does every 15 seconds.

You don’t need to understand what each individual metric means, but each row represents a separate timeseries which will be collected into prometheus’ time series database for graphing, alerting etc.

promhttp_metric_handler_requests_total{code="200"} 18
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^  ^^
            METRIC NAME                  LABELS   VALUE

4 Access the web interface

In your web browser, go to https://oob.srv1.campusX.ws.nsrc.org/prometheus

If you are using our cloud-based training platform use this address instead:

https://oob.srv1.campus1.p.vtp-us.nsrc.org/prometheus

You should see the Prometheus basic query interface. This is a great place to do ad-hoc PromQL queries and graphs.

Browse to the Status menu and select Targets to see a summary of the targets being scraped - so far only Prometheus itself.

NOTE: for this workshop we have configured Apache to proxy /prometheus to the prometheus server running on port 9090, and prometheus to expect this prefix. This hides the prometheus port number from the web URLs. It also makes it possible to apply access controls and HTTPS encryption at the web proxy - these are functions that prometheus itself doesn’t provide.

4.1 Counters

In the expression interface (click on “Graph” at the top of the page), start by typing “promhttp” - all the metrics which match this query will be shown.
Select “promhttp_metric_handler_requests_total” and then click “Execute”

You should see a table of results like this:

Element Value
promhttp_metric_handler_requests_total{code="200",instance="localhost:9090",job="prometheus"} 49
promhttp_metric_handler_requests_total{code="500",instance="localhost:9090",job="prometheus"} 0
promhttp_metric_handler_requests_total{code="503",instance="localhost:9090",job="prometheus"} 0

These are the most recent values for those metrics, stored in the time series database.

Now click on the “Graph” tab just below the expression interface, and you should see these values over time. The requests for code="200" have most likely been increasing over time.

These are examples of “counter” metrics: they increment every time a particular type of event takes place.

4.2 Gauges

Now back in the query interface, enter “scrape_duration_seconds”. Look at the Console (Table tab) and Graph outputs.

This is a value which can go up and down over time, and is called a “gauge”. It represents the absolute value of something - in this case, the amount of time the last scrape job took.

There are other gauges, for example “process_resident_memory_bytes”

Conventionally, the units for gauges are included at the end of the metric name, such as _seconds or _bytes.

4.3 Static series

Finally, enter “prometheus_build_info” as the query, and look at the Console output (the “Table” tab). You should see a result like this:

Element Value
prometheus_build_info{branch="HEAD",goversion="go1.14.6",instance="localhost:9090",job="prometheus",revision="983ebb4a513302315a8117932ab832815f85e3d2",version="2.20.1"} 1

This is an example of a static timeseries: the value is always 1. The interesting information is in the labels, in this case so you can see what version of prometheus you are running.

If you were to upgrade prometheus, then you’d get a new timeseries with a different set of labels, but the value of this new timeseries would still be “1”.

The graph view here isn’t very interesting - it’s just a flat line - but you can see when the timeseries starts and ends.

4.4 Status information

If you go to Status > Runtime & Build Information (menu at the top of the page)then you will get some summary information about Prometheus operation.

If you would like to see the number of metrics that Prometheus collecting on itself click on Status > TSDB Status. The section labeled “Number of Series” is particularly interesting. You should see something like “789” listed. This means there are 789 distinct time series actively being read into the database. You are collecting 789 metrics just about prometheus itself!

Fortunately the prometheus time series engine is very efficient, taking on average less than 2 bytes for each data point stored.

5 Modifying the prometheus configuration (NOTE)

If you change the prometheus configuration - which you’ll do many times in these exercises - you should not restart prometheus. This is because prometheus can take a long time at start up to read in its Write-Ahead Log files (WAL); and also because if you make an error, you don’t want prometheus to stop running.

Instead, you will send it a signal to reload:

killall -HUP prometheus       # or: systemctl reload prometheus
journalctl -eu prometheus     # check the logs!

If the new config is bad, prometheus will log an error and keep running with the old configuration. If the new config is good, it will log success and start using the new configuration.

There is also a command you can use to validate the configuration before you try to load it:

/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml

6 Command line queries

As well as the web interface, you can query prometheus using its REST API, and this can be done from the command line. Try for example:

/opt/prometheus/promtool query instant http://localhost:9090/prometheus prometheus_build_info
/opt/prometheus/promtool query instant http://localhost:9090/prometheus up

The first item gives you all the specific version information for the Prometheus you are currently running.

The second item tells you if Prometheus is up and ends with a Unix timestamp of the current time and date. If you want to conver that on the command line take the bit of output that looks something like @[1613883552.17] and just copy the number piece, then issue the command:

# date -d @1613883552.17

Use the number you got, not the one listed above as that’s from when these exercises were created.

7 Further reading