Teach-The-Expert: vSAN Diskgroup Management on CLI

As part of my work as a trainer, I often come across questions on topics that are only covered in passing or not at all in the course. This series of articles provides trainee IT experts with tools for everyday use.

Intro – What are Diskgroups?

VMware vSAN OSA (original storage architecture) structures the vSAN datastore into disk groups (DG). Each vSAN node can contain up to 5 disk groups. Each of these disk groups consists of exactly one cache device (SSD) and at least one to a maximum of 7 capacity devices per group. These may be either magnetic disks or SSDs, but no combination of the two. We differentiate between cache tier and capacity tier.

Disk groups can be managed using the graphical user interface (GUI). However, there are situations where disk group management on the command line interface (CLI) is necessary or more appropriate.

UUID

Each disk device of a vSAN cluster (OSA) has a universally unique identifier (UUID).

We can list all devices of a vSAN node on the CLI with this command:

esxcli vsan storage list

The sheer amount of information may be a bit too much and we only want to display the lines containing the UUID.

esxcli vsan storage list | grep UUID

We receive a list of all disk devices in the vSAN node. We also receive the UUID of the disk group to which the device is assigned.

If you take a closer look at the output, you will notice that there are some devices whose UUID is identical to the UUID of the diskgroup. Is this a contradiction to the statement that the UUID is unique? No. These are cache devices. Each diskgroup in vSAN OSA consists of exactly one cache device. The disk group adopts the UUID of its cache device. In this way, we can quickly distinguish a cache device from a capacity device.

Continue reading “Teach-The-Expert: vSAN Diskgroup Management on CLI”

Remove an unknown Disk from a vSAN Cluster

A failed capacity disk was removed from a vSAN 7 cluster before it could be logically detached from the disk group. The result is a remaining unknown disk device in the disk group that cannot be removed in the vSphere Client.

In such cases, esxcli is sometimes the more powerful tool.

We need to connect by SSH to the affected vSAN host.

Collect Information

Let’s check all registred disk devices on the node.

esxcli vsan storage list

A detailed list of all cache and capacity devices of this host will be displayed.

Output of disk devices (shortened)

Among the 24 active disks was the unknown zombie device. The only remaining feature was the vSAN UUID. The UUID can be used to detach the device from the configuration.

Remains of a physically removed disk in the vSAN configuration.

Extraction

The UUID of the missing unknown device was “52b17786-183b-e85f-f7f3-4befb19f67b0”. Using this information, we can remove it from the configuration.

esxcli vsan storage remove --uuid 52b17786-183b-e85f-f7f3-4befb19f67b0

The process takes a few seconds. Checking again with the esxcli vsan storage list command showed that the device was removed.

Manage ESXi Coredump Files

Okay, admit it, this is not a new topic, but it cost me some time in a client project. Since this blog also acts as a swap partition of my brain, I wrote it down for future reference. It is important to follow the steps correctly so that the changes are preserved after a reboot.

Why a Coredump-File?

Modern ESXi installations starting with version 7 use a new partition layout of the boot device. Coredumps are also located there. But only when the boot medium is not a USB flash medium and not an SD card. In such cases the coredump is relocated to a VMFS datastore with at least 32GB capacity.

This is exactly the case I found in a customer environment. The system was migrated from vSphere 6.7 and therefore still had the old boot layout on a ( at that time still fully supported) SD-Card RAID1. We found a vmkdump folder with files for each host on one of the shared VMFS datastores. This (VMFS5) datastore was supposed to be decommissioned and replaced with a VMFS6 datastore. (Side note from the VCI: there is no online migration path from VMFS5 to VMFS6) 😉 So the vmkdump files had to be removed from there.

Procedure

First, we get an inventory of the coredump files.

esxcli system coredump file list

All coredump files of all ESXi hosts are listed here. Each line contains the path and the Active and Configured (true or false) states. Active means that this is the current coredump file of this host. It is important that the value for Configured also has the status ‘true’. Otherwise the setting will not survive a reboot. Only the coredump file of the current host has the status ‘active’. All other files belong to other hosts and are therefore active=false.

By default, the host chooses the first matching VMFS datastore. This is not necessarily the desired one.

Remove the current Coredump-File

First we delete the active coredump file of the host. We have to force the removal because it is set as active=true.

esxcli system coredump file remove --force

If we execute the list command from above again, there should be one line less.

Add a new Coredump File

The next command creates a new coredump file at the destination. If it does not already exist, a vmkdump folder is created and the dumpfile is created in it. We specify the desired file name without extension, because it will be created automatically (.dumpfile).

esxcli system coredump file add -d <Name | UUID> -f <filename>

Example: Name of the host is “ESX-01” and the VMFS datastore has the name “Service”. The datastore may be specified as either DisplayName or Datastore_UUID.

esxcli system coredump file add -d Service -f ESX-01

A folder vmkdump will be created on the designated datastore and a file named ESX-01.dumpfile will be created in it. We can check this using the list command.

esxcli system coredump file list

A new line will appear with the full path to the new dumpfile. However, the status is still active=false and configured=false. It might be useful to copy this full path to the clipboard, because it is required in the next step.

Activate Dumpfile

In the following step, we set the created dumpfile to active. This way, the setting is retained even after a host reboot. We specify the complete path to the dumpfile. The copy from the clipboard is helpful here and avoids typos.

esxcli system coredump file set -p <path_to_dumpfile>

Example:

esxcli system coredump file set -p /vmfs/volumes/<UUID>/vmkdump/ESX-01.dumpfile

A final List command validates the result.

Links

vSAN Cluster Live-Migration to new vCenter instance

What can be done if the production vCenter Server appliance is damaged and you need to migrate a vSAN cluster to a new vCenter appliance?

In this post, I will show how to migrate a running vSAN cluster from one vCenter instance to a new vCenter under full load.

Anyone who works with vSAN will have a sinking feeling in their guts thinking about this. Why would one do such a thing? Wouldn’t it be better to put the cluster into maintenance mode? – In theory, yes. In practice, however, we repeatedly encounter constraints that do not allow a maintenance window in the near future.

Normally, vCenter Server appliances are solid and low-maintenance units. Either they work, or they are completely destroyed. In the latter case, a new appliance could be deployed and a configuration restore could be applied from the backup. None of this applied to a recent project. VCSA 6.7 was still working halfway, but key vSAN functionality was no longer operational in the UI. An initial idea to fix the problem with an upgrade to vCenter v7 and thus to a new appliance proved unsuccessful. Cross-vCenter migration of VMs (XVM) to a new vSAN cluster was also not possible, firstly because this feature was only available starting with version 7.0 update 1c, and secondly because only two new replacement hosts were available. Too few for a new vSAN cluster. To make things worse, the source cluster was also at its capacity limit.

There was only one possible way out: stabilize the cluster and transfer it to a new vCenter under full load.

There is an old, but still valuable post by William Lam on this topic. With this, and the VMware KB 2151610 article, I was able to work out a strategy that I would like to briefly outline here.

The process actually works because, once set up and configured, a vSAN cluster can operate autonomously from the vCenter. The vCenter is only needed for purposes of monitoring and configuration changes.

Continue reading “vSAN Cluster Live-Migration to new vCenter instance”