Solaris Cluster Recovering from Amnesia

Amnesia Scenario

Node node-1 is shut down.
Node node-2 crashes and will not boot due to hardware failure.
Node node-1 is rebooted but stops and prints out the messages:

    Booting as part of a cluster
    NOTICE: CMM: Node node-1 (nodeid = 1) with votecount = 1 added.
    NOTICE: CMM: Node node-2 (nodeid = 2) with votecount = 1 added.
    NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d4s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
    NOTICE: CMM: Node node-1: attempting to join cluster.
    NOTICE: CMM: Quorum device 1 (gdevname /dev/did/rdsk/d4s2) can not be acquired by the current cluster members. This quorum device is held by node 2.
    NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.

Node node-1 cannot boot completely because it cannot achieve the needed quorum vote count.

NOTE: for Oracle Solaris 3.2 update 1 and above with Solaris 10 : the boot continue, in non cluster mode, after a timeout.

In the above case, node node-1 cannot start the cluster due to the amnesia protection of Oracle Solaris Cluster. Since node node-1 was not a member of the cluster when it was shut down (when node-2 crashed) there is a possibility it has an outdated CCR and should not be allowed to automatically start up the cluster on its own.

The general rule is that a node can only start the cluster if it was part of the cluster when the cluster was last shut down. In a multi node cluster it is possible for more than one node to become "the last" leaving the cluster.

Solution

If this is a cluster with three or more nodes, start with a node that is suitable for starting the cluster. Eg a node connected to the majority of the storage. In this example node-1 represents this first node.

1. Stop node-1 and reboot in non-cluster mode. (Single user not necessary, only faster)

ok boot -sx

2. Make a backup of /etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.

# cd /etc/cluster/ccr
# /usr/bin/cp infrastructure infrastructure.old

Or if UPDATE_NOTE #1 applies

# cd /etc/cluster/ccr/global
# /usr/bin/cp infrastructure infrastructure.old

3. Get this node's id.

# cat /etc/cluster/nodeid

4. Edit the /etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.

Change the quorum_vote to 1 for the node that is up (node-1, nodeid = 1).

cluster.nodes.1.name node-1
cluster.nodes.1.state enabled
cluster.nodes.1.properties.quorum_vote 1

For all other nodes and any Quorum Device, set the votecount to zero.

Other nodes, whereis any node id but the one edited above:

cluster.nodes.#.properties.quorum_vote 0

Quorum Device(s)

cluster.quorum_devices.#.properties.votecount 0

5. Regenerate the checksum of the infrastructure file by running:

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/infrastructure -o

Or if UPDATE_NOTE #1 applies

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure -o

NOTE: If running SC 4.X SC 3.3, SC 3.2u3 or 3.2 Cluster core patch equal to or greater then 126105-36 (5.9) or 126106-36 (5.10) or 126107-36 (5.10 x86) the ccradm command would be

# /usr/cluster/lib/sc/ccradm recover -o infrastructure

6. Boot node node-1 into the cluster.

# /usr/sbin/reboot

7. The cluster is now started, so as long as other nodes have been mended they can be booted up and join the cluster again. When these nodes joins the cluster their votecount will be reset to its original value, and if a node is connected to any quorum device its voteqount will also be reset.

UPDATE_NOTE #1: If running Solaris Cluster 3.2u2 or higher, the directory path /etc/cluster/ccr is replaced with /etc/cluster/ccr/global. The same applies if running Cluster core patch equal to or greater then 126105-27 (5.9) or 126106-27 (5.10) or 126107-27 (5.10 x86)

Reference Doc : Doc ID 1018806.1