Scaling Prometheus

This worksheet is a roadmap to some of the options for scaling prometheus.

Single node scaling

A single prometheus node can scale quite large.

There is a calculator here which can estimate the RAM requirements for a single node. Note that additional RAM may be needed while queries are executing.

Experimentally, users have observed a single node with 12 cores and 64GB RAM ingesting 500,000 data points per second across 11 million timeseries. That is a lot of metrics!

Check for yourself how many metrics you are currently collecting:

Go to the prometheus web interface at http://oob-srv1-campusX.ws.nsrc.org/prometheus
On the top click Status and then TSDB Status
Find Head Stats and look at Number of Series. This gives you the number of distinct timeseries which were active over approximately the last two hours. Write down the value you see (number of series).
Since you are ingesting at 15 second intervals, divide this number by 15 to get the number of data points ingested per second. Write this down too (metrics per second)

How far are you away from 500,000 metrics per second?

Horizontal scaling

The simplest way to scale up is to have multiple prometheus servers - one per datacentre, per campus, per cloud region etc.

We already have our classroom set up this way - with one prometheus server per campus. Now we just need a way to get a global view of these servers.

Multiple data sources in Grafana

One option is to configure Grafana to talk to multiple prometheus servers.

Go to your grafana instance at http://oob-srv1-campusX.ws.nsrc.org/grafana

On the left hand side, select Connections > Data Sources
Click “Add mew data source”
Under Time Series Databases click “Prometheus”. It will auto-generate a name like “Prometheus-1”
Change the name to campusY (where this is a different campus to yours)
Prometheus server URL: http://srv1-campusY.ws.nsrc.org/prometheus (for the same remote campus)
Click “Save and test”

It should come back green. If not, check your work, and check that the other campus has a working prometheus instance.

You can add all the other campuses if you wish.

Modify dashboards

Now you need to modify your dashboards to be able to select these additional data sources. This is quite involved the first time you do it.

Go to one of your dashboards: we suggest the “SNMP Traffic” one that you created before
On the top menu, select Dashboard Settings (Cog)

Click on “Variables”. You should see your existing variables, which may be something like this:

instance       label_values(ifIndex,instance)
ifDescr        label_values(ifIndex{instance="$instance"},ifDescr)

You are now going to create a new variable called source which selects which Prometheus server you are querying. Click on “New variable”, then enter:
- Select variable type: choose “Data source” from dropdown
- General
  - Name: “source” (all in lower case)
  - Under Data Source Options: Type: select “Prometheus” from dropdown
- (Under “Preview of values” you should see the available prometheus servers)
- Click “Apply”
You’re back to the list of variables
Your new source variable will be at the bottom. Drag the “domino” control at the right to bring it to the top of the list, so your variables look like this:
```
source         prometheus
instance       label_values(ifIndex,instance)
ifDescr        label_values(ifIndex{instance="$instance"},ifDescr)
```
Click on the instance variable.
- Under Query Options - Data source, click on the dropdown. Change it from Prometheus to ${source}
- Click “Apply”
Repeat for the ifDescr variable.
- Under Query Options - Data source, click on the dropdown. Change it from Prometheus to $source
- Click “Apply”
Click “Close” at the top to go back to your dashboard
For each of the widgets and graphs on the page:
- Click on the three-dots next to the title, and select “Edit”
- Under the “Query” tab, where you see “Data source”, change it to ${source}
- Click on “Apply” at the top right to return to dashboard
- Repeat for all the other widgets and graphs on the page
Click “Save dashboard (floppy disk)” at the top of the page. Add a note like “Change to use $source”, and Save.

Your dashboard will now have a drop-down source selector. Choose the prometheus server at one of the other campuses, and browse their data!

As you can see, the problem with this approach is that you need to modify all your dashboards to include a $source selector - and this includes dashboards you may have imported from third parties. It can be quicker to edit the JSON form of the dashboard rather than editing every panel by hand.

Alternative: Promxy

Another option is to run a frontend called promxy in front of your prometheus servers. You send a query to promxy, and it sends it to the different prometheus backends and combines the query results. We will not do this in this exercise.

The advantage is that you can set up Grafana with a single prometheus data source (pointing to Promxy) and not have to configure multiple Prometheus backends or modify dashboards.

Remote storage

Prometheus has the ability to write to a remote database. This can be used to:

keep a centralised long term store of data
combine the outputs from multiple prometheus servers

VictoriaMetrics

There are a number of existing integrations, and indeed a recent version of Prometheus can itself be configured as a receiver for remote writes, but this exercise is going to use one called VictoriaMetrics.

Every campus will configure their own prometheus server to write to a central VictoriaMetrics database running on the NOC, which has already been set up by the instructors.

Test VictoriaMetrics

VictoriaMetrics listens on port 8428 by default, and exposes a prometheus-compatible API. On the NOC, path /vmetrics is proxied to port 8428.

Run the following command on your srv1 instance to check that you can communicate with the remote VictoriaMetrics instance running on the NOC:

/opt/prometheus/promtool query instant http://admin:password123@noc.ws.nsrc.org/vmetrics up

If this is a fresh install it may return no results at all, but what’s important is that you don’t get an error.

If you try the query without the username and password, you should get a “401” (unauthorized) error.

/opt/prometheus/promtool query instant http://noc.ws.nsrc.org/vmetrics up

Configure remote write

On your srv1, enter the prometheus container if you’re not already there:

incus shell prometheus

Edit your /etc/prometheus/prometheus.yml.

You will add an “external_labels” section under “global”. This is so that all metrics written to VictoriaMetrics will have an extra label like campus="campus1" to distinguish the metrics written from the different campuses. You will also add a remote_write section.

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    campus: campusX

# Archiving to VictoriaMetrics
remote_write:
  - url: http://noc.ws.nsrc.org/vmetrics/api/v1/write
    basic_auth:
      username: admin
      password: password123
    queue_config:
      max_samples_per_send: 10000
      max_shards: 30

... leave the rest of the file unchanged (from alertmanager configuration
... onwards)

Test your configuration:

/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml

If this shows any errors, fix them. Ask for help if you need to.

When this is OK, tell prometheus to re-read its configuration, then do a final check for errors:

systemctl reload prometheus
journalctl -eu prometheus

Repeat your query to the remote VictoriaMetrics server:

/opt/prometheus/promtool query instant http://admin:password123@noc.ws.nsrc.org/vmetrics up

Within a couple of minutes you should see your campus’ metrics appearing. These will have campus="campusX" as an additional label. If there are too many to see, then filter them in the query:

/opt/prometheus/promtool query instant http://admin:password123@noc.ws.nsrc.org/vmetrics 'up{campus="campusX"}'

Configure grafana

Getting Grafana to talk to VictoriaMetrics is just the same as you did when getting Grafana to talk to another prometheus server in another campus

Go to your grafana instance at http://oob-srv1-campusX.ws.nsrc.org/grafana

On the left hand side, select Connections > Data Sources
Click “Add new data source”
Under Time Series Databases click “Prometheus”
Name: VictoriaMetrics
Prometheus server URL: http://noc.ws.nsrc.org/vmetrics
Authentication Methods: select “Basic authentication”
- Username: admin
- Password: password123
Click “Save and test”

It should say “Successfully queried the Prometheus API” in green (if not, ask for help)

Go to your SNMP Traffic dashboard. Select “VictoriaMetrics” as the source from the dropdown, and you should be able to see all the merged data collected from the various campuses and stored in the central VictoriaMetrics database.

This makes it very easy to do queries which span multiple campuses; and you also can be sure that any expensive queries done here will not affect the scraping done by the remote prometheus servers. You could also use this central storage to keep a longer-term archive.

Remote storage: other options

Other large-scale storage options worth looking at include Thanos, Cortex and Mimir.

Thanos can store unlimited volumes of data to cheap S3 cloud storage, and performs downsampling of data which makes queries which cover long time periods much faster. It normally runs as a “sidecar” to prometheus, reading prometheus data chunks directly and uploading them to S3, although it can also act as a remote write receiver. Thanos has several components, so we are not going to set it up here, but it has a straightforward design where the components can be deployed incrementally.

Cortex is designed for huge cloud-scale, multi-tenant installations. It is an open-source CNCF project; Grafana Labs were the biggest contributor.

Mimir is a fork of Cortex by Grafana themselves, with a different license.

Federation

Another way to centralise storage is with federation. In this approach, a central prometheus server scrapes the remote prometheus servers to collect data out of them. You can limit it to scraping only selected metrics. If you wish, you can configure a larger scrape interval, so that the central server stores data at lower resolution.

Ask your instructor to set up federation on noc.ws.nsrc.org to collect data from all the campuses. They will need to add a new scrape job to prometheus.yml:

  - job_name: 'federate'
    scrape_interval: 2m

    honor_labels: true
    metrics_path: '/prometheus/federate'

    params:
      'match[]':
        - '{job="snmp"}'
        - '{job="node"}'

    static_configs:
      - targets:
          - 'srv1-campus1.ws.nsrc.org'
          - 'srv1-campus2.ws.nsrc.org'
          - 'srv1-campus3.ws.nsrc.org'
          - 'srv1-campus4.ws.nsrc.org'
          - 'srv1-campus5.ws.nsrc.org'
          - 'srv1-campus6.ws.nsrc.org'

When this is done, you should be able to access the web interface at http://noc.ws.nsrc.org/prometheus and perform queries - or add noc.ws.nsrc.org as another data source in your grafana dashboard.

Use up{job="federate"} as a query to check you’re seeing data ingested using this federation job.

Long term storage

By default, prometheus stores data for 15 days. You can change this by setting the configuration flag --storage.tsdb.retention.time. This setting is global and applies to all metrics.

However, prometheus’ database is not really designed for long-term storage. For long-term metric archival, you may be better off using a remote storage system such as VictoriaMetrics or Thanos.

To save storage and to speed up querying, you may also wish to store your long-term data at a lower resolution. This can be done by:

Using Federation, and scraping at a larger time interval
Using recording rules to generate new timeseries which are at a lower resolution - and then using remote_write with just the lower resolution timeseries
Using Thanos, which has built-in downsampling.

High availability

This is just information for reference.

For high availability in prometheus, simply run multiple prometheus servers scraping the same targets. You can use promxy in front of them to get a merged view: promxy will “fill in the gaps” where one server doesn’t have any data.

For high availability in alertmanager, you can run multiple alertmanagers in a cluster. You need to add flags to each alertmanager so they know about each other, and configure prometheus to talk to all alertmanagers.

If you have separate prometheus servers in multiple campuses or data centres, you might want a separate alertmanager (or alertmanager cluster) in each campus or data centre. To get a global dashboard which shows you all the alertmanagers, you can install karma or alerta.