Ceph lab

Your Proxmox clusters have Ceph already configured. In this lab you’re going to use them for storing your VM image.

Examining the cluster

Some steps can be done via the GUI, and some via the proxmox host’s CLI shell. We give both there; one person in the group can do one, and one the other.

Basic health check (GUI)

Click “Datacenter (clusterX)” in the left, and Ceph in the next column along.

You may see a health warning:

1 pools have too many placement groups

Click “+” next to this for more details:

Pool vmpool has 128 placement groups, should have 32

(This is a recommendation from ceph, based on how full the pool currently is, and you will see it if the calculated “ideal” value is more than three times away from the current. In this case, the pool is empty)

Look also at the states of the OSDs: are they “up” or “down”, and “in” or “out”? What about the monitors and managers?

The metadata store won’t be running: this is part of cephfs, the file store (as opposed to rbd block store), and it hasn’t been enabled.

Under the “Performance” section, can you see the total storage size?

The cluster should have been built with three storage servers (host7-9) each having 4 disks, and each disk is 8GiB. Does that agree?

Note that this can be misleading, because vmpool stores three copies of each block. In that case, how much usable space do we have?

Now select “Storage” in the second column. You’ll get a list of Proxmox storage resources, one of which is “vmpool”, type RBD. Underneat this is a Ceph pool also called “vmpool”.

To get Ceph’s view of pools, then select any host in the left column, “Ceph” in the second column, and under that “Pools”.

You should see some basic info (including utilization) of each pool.

Basic health check (CLI)

These Ceph command line tools must be run on one of the hosts which is running as a Ceph Monitor, which are host1-3 in your cluster (you can see them in the GUI in the main Ceph status page, in a section headed “Services”). Pick any one of these.

To get to the command line, open a shell on your host by selecting the host in the left column, and >_ Shell in the next column.

Try these commands:

ceph health                      # current alerts
ceph healthcheck history ls      # history of alerts

pveceph pool ls                  # list all pools
pveceph pool get vmpool          # details about this pool
ceph osd pool autoscale-status   # more info about PG scaling calculations

ceph osd status                  # status of all OSDs
ceph osd perf                    # show OSD latencies (may all be zero right now)

(Note that Proxmox includes some of its own admin tools under pveceph, but in many cases you have to use the underlying tool ceph)

To get a rolling health check which refreshes automatically when something changes, type ceph -w. After its initial message it will hang waiting for a change. Hit Ctrl-C to exit it.

Convert VM storage

For this part, find the VM which you created previously from the Ubuntu cloud image; it may be called something like “groupX-web”.

(It doesn’t matter whether it’s started or not. Proxmox allows storage conversion to happen “live”)

Click on this VM in the first column
Click on “Hardware” in the second column
Click on “Hard Disk (scsi0)” in the third section
Above this, click on “Disk Action > Move storage”

You will then get an action dialog. Select:

Target Storage: vmpool
Delete source: leave unchecked (for safety, in case a problem occurs during migration)

Then click “Move disk”. You should see progress as the disk contents are copied, hopefully ending with TASK OK.

NOTE: If the migration hangs, it could be because your ceph cluster has filled up!! Check the Ceph summary page to see if this has happened. The instructor will use this as a discussion point.

If conversion completes successfully, you should then remove the old disk which will appear as “Unused Disk 0” in the “Hardware” section of your VM.

Live migration

What’s different? Since the VM is using shared storage, you can live-migrate it to another host without having to copy the disk (as it’s accessible equally from anywhere in the cluster)

Start your VM, if it isn’t running already
Select it, click the Migrate button, and move it to another host in the range host1-host6
Does it complete more quickly than the previous live migration exercise?
Migrate it back to your normal group’s host

If you look carefully, you’ll see it’s still migrating the cloud-init disk. The Disk Action button is greyed out for this device, so you can’t move it to Ceph, but as it’s very small that doesn’t matter.

Check ceph status

Now to look at how the cluster status has changed, as you and other groups have moved data into ceph.

Cluster health

Go back to the Datacenter > Ceph summary page, and look at “Usage”. Has it increased?

OSD utilization

Select any node in your cluster, “Ceph” in the second panel, and under that “OSDs”

This will show you the status of all the OSDs, and the percentage of storage utilization of each one. Note how the OSDs are not equally utilized. (Larger clusters will tend to balance much better than this)

List volumes

Check what volumes (disk images) exist in this pool.

From the GUI: in the left column, under Datacenter, and under your cluster host, select the storage item “vmpool (clusterX-hostY)”. Then click on “VM Disks” in the second column.

On the right, can you see the disks that you and people in other groups have created?

From the CLI:

ceph df                   # show total pool sizes
rbd ls -l -p vmpool       # which block volumes have been allocated in the pool?
rbd du -p vmpool          # show provisioned size and allocated size

Much lower-level data can be found too, for example:

ceph pg ls-by-pool vmpool # shows exactly which sets of disks each PG is held on

Re-check ceph osd perf and see if you now have latency figures for the OSDs. This table would be able to show you if you have one disk which is performing significantly worse than the others (which may degrade overall cluster performance, and/or may be about to fail)

This section is for information only.

Understanding Proxmox Ceph disk space reporting

The Proxmox web interface shows summary usage information for Ceph pools:

In the above example, it shows “10.06 GB of 30.23 GB” used. But understanding this is not straightforward, due to several factors:

Thin provisioning (e.g. allocating an 8GiB disk for a VM does not immediately consume 8GiB of space, only those blocks which have been written to)
Replication (writing 1 GiB of data will have 3 copies, making 3 GiB)
1 GB is 1000x1000x1000 bytes, but 1 GiB is 1024x1024x1024 bytes
It is possible to set quota limits on a pool (we’ve not done this)

To understand where Proxmox gets its data, you can use lower level tools:

# rbd du -p vmpool
NAME           PROVISIONED  USED
vm-100-disk-0        8 GiB  2.2 GiB
vm-101-disk-0        8 GiB  2.1 GiB
vm-102-disk-0        8 GiB  3.1 GiB
vm-103-disk-0        8 GiB  2.3 GiB
vm-111-disk-0        8 GiB      0 B
<TOTAL>             40 GiB  9.7 GiB

# ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
hdd    96 GiB  64 GiB  32 GiB    32 GiB      32.99
TOTAL  96 GiB  64 GiB  32 GiB    32 GiB      32.99

--- POOLS ---
POOL    ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
vmpool   1  128  9.4 GiB    2.50k   28 GiB  33.28     19 GiB
.mgr     2    1  705 KiB        2  2.1 MiB      0     19 GiB

The first command shows that 5 VM disks have been created, of size 8 GiB each, making 40 GiB which might be required. But of that space, only 9.7 GiB has been written, because of thin provisioning.

The second shows that vmpool has 9.4 GiB of stored data (ignore the small discrepancy), but this uses 28 GiB of disk space. This is due to replication.

It also says there is 19 GiB of space remaining in the pool.

Where does that figure come from? We can get more detail from the JSON format output:

# ceph df detail -f json | python3 -mjson.tool
{
...
    "pools": [
        {
            "name": "vmpool",
            "id": 1,
            "stats": {
                "stored": 10062511104,               <<<<
                "stored_data": 10062511104,
                "stored_omap": 0,
                "objects": 2500,
                "kb_used": 29480136,
                "bytes_used": 30187659264,
                "data_bytes_used": 30187659264,
                "omap_bytes_used": 0,
                "percent_used": 0.3328298032283783,  <<<<
                "max_avail": 20170776576,            <<<<
                "quota_objects": 0,
                "quota_bytes": 0,
                "dirty": 0,
                "rd": 42605,
                "rd_bytes": 970658816,
                "wr": 14748,
                "wr_bytes": 10320263168,
                "compress_bytes_used": 0,
                "compress_under_bytes": 0,
                "stored_raw": 30187532288,           <<<<
                "avail_raw": 60512331611             <<<<
            }
        },
...

“avail_raw” says there are 60,512,331,611 bytes unused in total. But because our pool has a replication factor of 3, the “max_avail” is 20,170,776,576 bytes; this is an estimate of how much more data we could store.

This agrees with the “19 GiB” summary figure shown before, when you convert it to GiB:

20170776576 / 1024 / 1024 / 1024 = 18.79 GiB

Similarly, “stored” says that there are 10,062,511,104 bytes of data stored (which is 9.37 GiB). But “stored_raw” is 3 times larger, because of replication.

Proxmox shows the pool’s “percent_used” figure, and it also estimates the total size of the pool by adding the used and available figures:

stored + max_avail
= 10062511104 + 20170776576
= 30233287680
= 30.23 GB

The fact that the total size has to be calculated in this way means that it may vary slightly over time, as the green section of the graph shows above, due to underlying ceph overheads. It could also vary for other reasons, e.g. if other pools are using the same OSDs, or if you applied or modified a pool quota.

Note that even the “percent_used” figure can be misleading, because in fact we’ve already overcommitted the storage: if all five VMs were to write to all 8GiB of their attached disks, the total data storage requirement would be 40 GiB, which is more than we can fit on the pool. Ceph would hang and block further writes when it fills.