Cluster Retreat Mode gone wrong – vSphere Client lockout

With the release of vSphere 7.0 Update 1, vSphere Cluster Services VMs (vCLS) appeared in vSphere clusters for the first time. This made cluster functions such as Distributed Resource Scheduler (DRS) and others independent of the availability of the vCenter Server Appliance (VCSA) for the first time. The latter still represents a single point of failure in the cluster. By outsourcing the DRS function to the redundant vCLS machines, a higher degree of resilience has been achieved.

Retreat Mode

The vSphere administrator has little influence on the provisioning of these VMs. Occasionally, however, it is necessary to remove these VMs from a datastore if it is to be put into maintenance mode, for example. There is a procedure for setting the cluster to retreat mode. This involves setting temporary advanced settings that lead to the deletion of the vCLS VMs by the cluster.

According to the VMware procedure, the Domain ID must be determined to activate Retreat Mode. The domain ID is the numerical value between ‘domain-c’ and the following colon. In the example from my lab, it has the value 8, but the number can also have four digits or more.

The domain ID has to be transferred to the Advanced Settings of the vCenter.

config.vcls.clusters.domain-c8.enabled = false
Correct Retreat Mode settings.

Admin error occured during activation of retreat mode.

After activating retreat mode on a vSAN cluster, administrators had lost all privileges to all objects in the vSphere Client.

A review of the services showed that the vCenter Server Daemon (vpxd) was not running.

What happened?

In this particular case, in addition to the domain ID, a colon was copied from the URL.

config.vcls.clusters.:domain-c8.enabled = false

This has led to the following effects:

  • Center Server daemon vmware-vpxd crashed and stopped working
  • No objects visible in the vSphere-Client.
  • vCenter Server daemon vmware-vpxd refused to start.

What now?

To solve the problem, we need to access the shell of the VCSA and clean up the configuration of the vpxd. Change directory to vmware-vpx and verify the configuration file.

cd /etc/vmware-vpx
cat vpxd.cfg

If retreat mode has been activated in the cluster for the first time, a vcls section (red) is created in vpxd.cfg.

The image above shows our incorrect section with the flawed domain ID.

Now use the vi editor to delete all lines including the vcls tags. The easiest way to do this is in vi command mode. To do this, place the cursor on the line to be deleted and type ‘dd‘. This will delete the entire line. Then save the vpxd.cfg and exit vi with the :wq command (write and quit).

:wq

The modified vpxd.cfg file is reactivated by starting the vmware-vpxd service.

service-control --start vmware-vpxd

As soon as the vcls section in vpxd.cfg has been cleared, the advanced setting config.vcls.clusters in the vSphere Client also disappears. We are back to the factory settings, as if the retreat mode had never been activated.

Closing remarks

I should add that the advanced parameters for the retreat mode are essentially very robust. I had some difficulty reproducing the error in my lab. Most of the invalid parameters were simply ignored.

wrong Domain IDignored
UID instead of Domain IDignored
Forward slash after UIDignored
Domain ID plus UIDignored
Colon in front of domain-cvpxd crash
possibly very delayed or only after restart

Leave a Reply

Your email address will not be published. Required fields are marked *