1 Scenario 1: loss of a Slave Node
- 1.1 Part 1: Loss of network connectivity
- 1.2 Alternate decisions
  - 1.2.1 Completely removing H2 from the cluster
2 Scenario 2: Loss of Master Node
- 2.1 Promoting slave

We are going to simulate a number of failure situations, and recover from them.

Try and replicate the scenarios on your hosts.

1 Scenario 1: loss of a Slave Node

1.1 Part 1: Loss of network connectivity

1.1.1 Description

Cluster with 2 or more Nodes
Master is up (H1)
Slaves are up (H2, H3 etc)
DRBD instance "debianX" is running on one of the Slaves (let's say H2) and is replicated to either the Master or another Slave

1.1.2 Process

Confirm that debianX (or what the name of the DRBD VM you are using is) is running on H2 (gnt-instance list)

# gnt-instance list -o name,pnode,snodes,status

Instance Primary_node      Secondary_Nodes   Status
debianX  H2.ws.nsrc.org    H1.ws.nsrc.org    running

Shut down (halt) H2 (make sure you run this on H2!)

# halt -p

The VM goes down as a result (confirm this using ping / console)

# gnt-instance list -o name,pnode,snodes,status

Instance Primary_node      Secondary_Nodes   Status
debianX  H2.ws.nsrc.org    H1.ws.nsrc.org    ERROR_nodedown

Run gnt-cluster verify (will take a while), and look at the output.

As you notice, things are quite slow. Let's start by marking H2 as offline:

# gnt-node modify -O yes H2.ws.nsrc.org

Modified node H2.ws.nsrc.org
 - master_candidate -> False
 - offline -> True

It will take a little while, but now most commands will run faster as Ganeti stops trying to contact the other nodes in the cluster.

Try running gnt-instance list and gnt-node list again.

Also re-run gnt-cluster verify

1.1.3 Recovery

We cannot live-migrate the host (H2 is down), so we need to failover

If you attempt to migrate, you will be told:

# gnt-instance migrate debianX

Failure: prerequisites not met for this operation:
error type: wrong_state, error details:
Can't migrate, please use failover: Error 7: Failed connect to 10.10.0.X:1811; No route to host

Attempt failover

# gnt-instance failover debianX

Hopefully you will see messages ending with:

...
Sat Jan 18 15:58:11 2014 * activating the instance's disks on target node H1.ws.nsrc.org
Sat Jan 18 15:58:11 2014  - WARNING: Could not prepare block device disk/0 on node H2.ws.nsrc.org (is_primary=False, pass=1): Node is marked offline
Sat Jan 18 15:58:11 2014 * starting the instance on the target node H1.ws.nsrc.org

If so, skip to the section "Confirm that the VM is now up on H1"

If you see this message:

Sat Jan 18 20:57:55 2014 Failover instance debianX
Sat Jan 18 20:57:55 2014 * checking disk consistency between source and target
Failure: command execution error:
Disk 0 is degraded on target node, aborting failover

... you will need to force the operation. To do so:

Read man page on gnt-instance, find the section about failover:

If you are trying to migrate instances off a dead node, this will fail. Use the --ignore-consistency option for this purpose. Note that this option can be dangerous as errors in shutting down the instance will be ignored, resulting in possibly having the instance running on two machines in parallel (on disconnected DRBD drives).

This is why we shut down H2, and didn't simply disconnect. You MUST verify that H2 really is down, and not simply disconnected from the management / replication network, otherwise you risk ending up with two running instances of VM (if someone force starts it) and you will need to force a resolution.
Re-run gnt-instance failover with the '--ignore-consistency' flag. We are in a situation that requires this (H2 down)

# gnt-instance failover --ignore-consistency debianX

There will be much more output this time, pay attention in particular if you see some warnings - these are normal since the H2 node is down, but we did it mark it as offline.

Sat Jan 18 21:03:15 2014 Failover instance debianX
Sat Jan 18 21:03:15 2014 * checking disk consistency between source and target

[ ... messages ... ]

Sat Jan 18 21:03:27 2014 * activating the instance's disks on target node H1.ws.nsrc.org

[ ... messages ... ]

Sat Jan 18 21:03:33 2014 * starting the instance on the target node H1.ws.nsrc.org

Confirm that the VM is now up on H1:

# gnt-instance list -o name,pnode,snodes,status

Instance Primary_node      Secondary_Nodes   Status
debianX  H1.ws.nsrc.org    H2.ws.nsrc.org    running

1.1.4 Re-adding the failed node

Ok, let's say H2 has been fixed.

Restart H2. (Depending on the class setup, you may need to ask the instructor to do this for you).
Make sure you can ping it and can log in to it

We need to re-add it to the cluster. We do this using the gnt-node add --readd command on the cluster master node.

From the gnt-node man page:

In case you're readding a node after hardware failure, you can use the --readd parameter. In this case, you don't need to pass the secondary IP again, it will reused from the cluster. Also, the drained and offline flags of the node will be cleared before re-adding it.

# gnt-node add --readd H2.ws.nsrc.org

[ ... question about SSH ...]

Sat Jan 18 22:09:43 2014  - INFO: Readding a node, the offline/drained flags were reset
Sat Jan 18 22:09:43 2014  - INFO: Node will be a master candidate

We're good! It could take a while to re-sync the DRBD data if a lot of disk activity (writing) has taken place on debianX, but this will happen in the background.

Check the cluster:

# gnt-cluster verify

Probably the DRBD instances on H2 have not been activated. If so, you can use the command gnt-cluster verify-disks to fix this:

# gnt-cluster verify-disks
# gnt-cluster verify

When all is OK, let's try and migrate debianX back to H2:

# gnt-instance migrate debianX

Test that the migration has worked.

Note: if you are certain that the node H2 is healthy (let's say it was just a power failure, and no corruption has happened on its filesystem or disks), you could simply do the following (DON'T DO THIS NOW!):

# gnt-node modify -O no H2.ws.nsrc.org

Sat Jan 18 22:08:45 2014  - INFO: Auto-promoting node to master candidate
Sat Jan 18 22:08:45 2014  - WARNING: Transitioning node from offline to online state without using re-add. Please make sure the node is healthy!

But you would be warned about this.

1.2 Alternate decisions

1.2.1 Completely removing H2 from the cluster

If we were certain that H2 cannot be fixed, and won't be back online, we could delete H2 from the cluster. To do this:

Run gnt-cluster verify, and look at the output, you will see

Sat Jan 18 21:31:56 2014   - NOTICE: 1 offline node(s) found.

We marked H2 as down - let's assume H2 will be down for a while while it's being fixed.
We decide to remove H2 from the cluster:

# gnt-node remove H2.ws.nsrc.org

Failure: prerequisites not met for this operation:
error type: wrong_input, error details:
Instance debianX is still running on the node, please remove first

Ok, we are not allowed to remove the H2, because Ganeti can see that we still have an instance (debianX) associated with H2.

This is different from simply marking the node offline, as it means we are permanently getting rid of H2, and we need to take a decision about what to do for DRBD instances that were associated with H2.

So what do we do now ? If we had a third node (H3), we could use the gnt-node evacuate. Read the man page for gnt-node and look for the section about the evacuate subcommand.

gnt-node evacuate is used to move all DRBD instances from one failed node to others.

If the cluster has only two nodes, we need to first temporarily convert our instance debianX from DRBD to plain... Unfortunately this requires shutting down the VM instance first.

# gnt-instance shutdown debianX

Wait until it is down (you will see some WARNINGs again), then:

gnt-instance modify -t plain -n H1.ws.nsrc.org debianX

Sat Jan 18 21:40:54 2014 Converting template to plain
Sat Jan 18 21:40:54 2014 Removing volumes on the secondary node...
Sat Jan 18 21:40:54 2014 Removing unneeded volumes on the primary node...
Modified instance debianX
 - disk_template -> plain

(WARNINGs removed in the output above)

We shoud now be able to remove the node:

# gnt-node remove H2.ws.nsrc.org

More WARNINGs! But did it work ?

# gnt-node list

Node              DTotal DFree MTotal MNode MFree Pinst Sinst
H1.ws.nsrc.org  29.1G 12.6G   995M  145M  672M     2     0

Yes, H2 is gone.

Note: Ganeti will modify /etc/hosts on your remaining nodes, and remove the line for H2!

We can restart our debianX instance, by the way!

# gnt-instance start debianX

Test that it comes up normally.

2 Scenario 2: Loss of Master Node

Let's imagine a slightly more critical scenario: the crash of the master node.

Let's shut down the master node!

On H1:

# halt -p

The node is now down. VM still running on other nodes are unaffected, but you are not able to make any changes (stop, start, modify, add VMs, change cluster configuration, etc...)

2.1 Promoting slave

Let's assume that H1 is not coming back right now, and we need to promote a master.

You will first need to decide which of the remaining nodes will become the master.

Read about master-failover: man gnt-cluster, find the MASTER-FAILOVER section.

To promote the slave:

Log on to the node that will become master (let's say H2):
Run the following command:

# gnt-cluster master-failover

Note here that you will NOT be asked to confirm the operation!

Note that if you were running in a 2 node configuration, you may have to specify --no-voting as an option: since there is no other remaining node in the cluster, no voting can take place anyway.

At this point, the chosen node (H2) is now master. You can verify this using the gnt-cluster getmaster command.

From this point, recovering downed machines is similar to what we did in the first scenario. But to be on the safe side:

Restart H1, and log in to it as root
Try and run gnt-instance list

Normally, even though H1 was down while the promotion of H2 happened, the ganeti-masterd daemon running on H1 was informed, on startup, that H1 was no longer master. The above command should therefore fail with:

This is not the master node, please connect to node 'H2.ws.nsrc.org' and
rerun the command

Which means that H1 is well aware that H2 is the master now.

Once you have done this, you may find that H2 and H1 have different versions of the cluster database. Type the following on H2:

# gnt-cluster verify
...
Sat Jan 18 16:11:12 2014   - ERROR: cluster: File /var/lib/ganeti/config.data found with 2 different checksums (variant 1 on H2.ws.nsrc.org, H3.ws.nsrc.org; variant 2 on H1.ws.nsrc.org)
Sat Jan 18 16:11:12 2014   - ERROR: cluster: File /var/lib/ganeti/ssconf_master_node found with 2 different checksums (variant 1 on H2.ws.nsrc.org, H3.ws.nsrc.org; variant 2 on H1.ws.nsrc.org)

You can fix this by:

# gnt-cluster redist-conf

which pushes out the config from the current master to all the other nodes.

Re-run gnt-cluster verify to check everything is OK again.

Then to make H1 take over the master role again, login to H1 and run:

# gnt-cluster master-failover

Ganeti: failures and recovery scenarios