How to perform disaster recovery

An etcd cluster automatically recovers from temporary failures of cluster members. If cluster members are lost permanently, etcd is able to withstand up to (N-1)/2 lost members.

As stated by the official etcd docs:

Caution

If the cluster permanently loses more than (N-1)/2 members then it loses quorum and fails. Once quorum is lost, the cluster cannot reach consensus and therefore cannot perform write operations anymore.

This condition is called majority failure.

Detect majority failure

Charmed etcd checks regularly if the cluster is in healthy state. If it detects majority failure, it will update the status of the deployed charmed etcd application. You can observe this with juju status:

Model  Controller      Cloud/Region         Version  SLA          Timestamp
etcd   dev-controller  localhost/localhost  3.6.5    unsupported  13:22:15Z

App   Version  Status   Scale  Charm         Channel  Rev  Exposed  Message
etcd           blocked      3  charmed-etcd  3.6/edge  52  no       Cluster failure - majority of cluster members lost. Run action `rebuild-cluster` to recover.

Unit      Workload  Agent  Machine  Public address  Ports  Message
etcd/0*   blocked   idle   0        10.136.16.45           Cluster failure - majority of cluster members lost. Run action `rebuild-cluster` to recover.
etcd/1    unknown   lost   1        10.136.16.204          agent lost, see 'juju show-status-log etcd/1'
etcd/2    unknown   lost   2        10.136.16.118          agent lost, see 'juju show-status-log etcd/2'

Machine  State    Address        Inst id         Base          AZ  Message
0        started  10.136.16.45   juju-53b717-0   [email protected]      Running
1        down     10.136.16.204  juju-53b717-1   [email protected]      Running
2        down     10.136.16.118  juju-53b717-2   [email protected]      Running

The failed state is also printed in the log:

unit-etcd-0: 13:23:23 WARNING unit.etcd/0.juju-log Cluster failed - no raft leader
unit-etcd-0: 13:23:23 ERROR unit.etcd/0.juju-log Cluster failure - majority of cluster members lost. Run action `rebuild-cluster` to recover.

Recover from majority failure

Charmed etcd provides a mechanism to recover from majority failure. It can be executed by running the action rebuild-cluster.

This will:

  • stop the cluster

  • keep existing data

  • re-initialise the cluster members

  • form a new cluster with the remaining, functional units and the existing data

Caution

Before executing the action rebuild-cluster, make sure to remove the failed Juju units.

All units that are permanently lost, for example because their machine crashed, need to be removed first. In our example, these are the units etcd/1 and etcd/2. If they can not be reached anymore by the Juju controller, they might need to be removed with the --force option.

Remove them by running:

juju remove-unit etcd/1 etcd/2 --force --no-wait

Caution

This is a potentially dangerous command: --force will remove a unit and its machine without a clean shutdown. It should only be executed if no other option exists.

After removing all faulty units, juju status will still show that the cluster is in the state of majority failure:

Model  Controller      Cloud/Region         Version  SLA          Timestamp
etcd   dev-controller  localhost/localhost  3.6.5    unsupported  13:46:47Z

App   Version  Status   Scale  Charm         Channel  Rev  Exposed  Message
etcd           blocked      1  charmed-etcd  3.6/edge  52  no       Cluster failure - majority of cluster members lost. Run action `rebuild-cluster` to recover.

Unit      Workload  Agent  Machine  Public address  Ports  Message
etcd/0*   blocked   idle   0        10.136.16.45           Cluster failure - majority of cluster members lost. Run action `rebuild-cluster` to recover.

Machine  State    Address       Inst id         Base          AZ  Message
0        started  10.136.16.45  juju-53b717-0   [email protected]      Running

Now it’s time to recover by running the command: juju run etcd/leader rebuild-cluster. This will give you the output:

Running operation 48 with 1 task
  - task 49 on unit-etcd-0

Waiting for task 49...
result: cluster rebuild in progress

You can observe the recovery process with juju status again:

Model  Controller      Cloud/Region         Version  SLA          Timestamp
etcd   dev-controller  localhost/localhost  3.6.5    unsupported  13:49:12Z

App   Version  Status   Scale  Charm         Channel  Rev  Exposed  Message
etcd           blocked      1  charmed-etcd  3.6/edge  52  no       Rebuilding with new cluster configuration...

Unit      Workload  Agent       Machine  Public address  Ports  Message
etcd/0*   blocked   executing   0        10.136.16.45           Rebuilding with new cluster configuration...

Machine  State    Address       Inst id         Base          AZ  Message
0        started  10.136.16.45  juju-53b717-0   [email protected]      Running

Shortly after, when the process is finished, your charmed etcd application has been recovered and is functional again:

Model  Controller      Cloud/Region         Version  SLA          Timestamp
etcd   dev-controller  localhost/localhost  3.6.5    unsupported  13:51:46Z

App   Version  Status  Scale  Charm         Channel  Rev  Exposed  Message
etcd           active      1  charmed-etcd  3.6/edge  52  no       

Unit      Workload  Agent  Machine  Public address  Ports  Message
etcd/0*   active    idle   0        10.136.16.45           

Machine  State    Address       Inst id         Base          AZ  Message
0        started  10.136.16.45  juju-53b717-0   [email protected]      Running