What can be done if the production vCenter Server appliance is damaged and you need to migrate a vSAN cluster to a new vCenter appliance?
In this post, I will show how to migrate a running vSAN cluster from one vCenter instance to a new vCenter under full load.
Anyone who works with vSAN will have a sinking feeling in their guts thinking about this. Why would one do such a thing? Wouldn’t it be better to put the cluster into maintenance mode? – In theory, yes. In practice, however, we repeatedly encounter constraints that do not allow a maintenance window in the near future.
Normally, vCenter Server appliances are solid and low-maintenance units. Either they work, or they are completely destroyed. In the latter case, a new appliance could be deployed and a configuration restore could be applied from the backup. None of this applied to a recent project. VCSA 6.7 was still working halfway, but key vSAN functionality was no longer operational in the UI. An initial idea to fix the problem with an upgrade to vCenter v7 and thus to a new appliance proved unsuccessful. Cross-vCenter migration of VMs (XVM) to a new vSAN cluster was also not possible, firstly because this feature was only available starting with version 7.0 update 1c, and secondly because only two new replacement hosts were available. Too few for a new vSAN cluster. To make things worse, the source cluster was also at its capacity limit.
There was only one possible way out: stabilize the cluster and transfer it to a new vCenter under full load.
There is an old, but still valuable post by William Lam on this topic. With this, and the VMware KB 2151610 article, I was able to work out a strategy that I would like to briefly outline here.
The process actually works because, once set up and configured, a vSAN cluster can operate autonomously from the vCenter. The vCenter is only needed for purposes of monitoring and configuration changes.
With vSphere7 fundamental changes in the structure of the ESXi boot medium were introduced. A fixed partition structure had to give way to a more flexible partitioning. More about this later.
With vSphere 7 Update 3 VMware also brought bad news for those using USB or SDCard flash media as boot devices. Increasing read and write activity led to rapid aging and failure of these types of media, as they were never designed to handle such a heavy load profile. VMware put these media on the red list and the vSphere Client throws warning messages in case such a media is still in use. We will explore how to replace USB or SDCard boot media.
ESXi Boot Medium: Past and Present
In the past, up to version 6.x, the boot medium was rather static. Once the boot process was complete, the medium was no longer important. At most, there was an occasional read request from a VM to the VM Tools directory. Even a medium that broke during operation did not affect the ESXi host. Only a reboot caused problems. For example, it was still possible to backup the current ESXi configuration even if the boot medium was damaged.
Layout of the boot media up to ESXi 6.7
In principle, the structure was nearly always the same: A boot loader of 4 MB size (FAT16), followed by two boot banks of 250 MB each. These contain the compressed kernel modules, which are unpacked and loaded into RAM at system boot. A second boot bank allows a rollback in case of a failed update. This is followed by a “Diagnostic Partition” of 110 MB for small coredumps in case of a PSOD. The Locker or Store partition contains e.g. ISO images with VM tools for all supported guest OS. From here VM tools are mounted into the guest VM. A common source of errors during the tools installation was a damaged or lost locker directory.
The subsequent partitions differ depending on the size and type of the boot media. The second diagnostic partition of 2.5 GB was only created if the boot medium is at least 3.4 GB (4MB + 250MB + 250MB + 110MB + 286MB = 900MB). Together with the 2.5 GB of the second diagnostic partition, this requires 3.4 GB.
A 4 GB scratch partition was created only on media with at least 8.5 GB. It contains information for VMware support. Anything above that was provisioned as VMFS data store. However, scratch and VMFS partition were created only if the media was not USB flash or SDCard storage. In this case, the scratch partition was created in the host’s RAM. With the consequence that in the event of a host crash, all information valuable for support was lost as well.
Structure of the boot media from ESXi 7 onwards
The layout outlined above made it difficult to use large modules or third-party modules. Hence, the design of the boot medium had to be changed fundamentally.
First, the boot partition was increased from 4 MB to 100 MB. The two boot banks were also increased to at least 500 MB. The size is flexible, depending on the total size of the medium. The two diagnostic partitions (Small Core Dump and Large Core Dump), as well as Locker and Scratch have been merged into a common ESX-OSData partition with flexible size between 2.9 GB and 128 GB. Remaining space can be optionally provisioned as VMFS-6 datastore.
There are four different boot media size classes in vSphere 7:
4 GB – 10 GB
10 GB – 32 GB
32 GB – 128 GB
> 128 GB
The partition sizes shown above are for freshly installed boot media on ESXi 7.0, but what about boot media migrated from version 6.7?
Backing up and restoring an ESXi host configuration is a standard procedure that can be used when performing maintenance on the host. Not only host name, IP address and passwords are backed up, but also NIC and vSwitch configuration, Object ID and many other properties. Even after a complete reinstallation of a host, it can recover all the properties of the original installation.
Recently I wanted to reformat the bootdisk of a host in my homelab and had to fresh install ESXi for this. The reboot with the clean installation worked fine and the host got a new IP via DHCP.
Now the original configuration was to be restored via PowerCLI. To do this, first put the host into maintenance mode.
The command prompts for a root login and then automatically reboots. At the end of the boot process, an empty DCUI was welcoming me.
I haven’t seen this before. I was able to log in (with the original password), but all network connections were gone. The management network configuration was also not available for selection (grayed out). The host was both blind and deaf.
Strict security policies are in place in many corporate environments. This means that it is only possible to access internet resources to a limited extent, if at all. This becomes apparent, for example, when trying to install PowerCLI on a management system. While the availability of PowerCLI modules in the PowerShell Gallery provides an easy way to install or update PowerCLI, this is only possible if access to this external resource is allowed by Powershell. Using the Powershell Gallery requires the NuGet Packet Management Provider. This must also be obtained online.
If the Internet connection is restricted or blocked, the above command fails. But you can also transfer the modules offline. For this you need a PC with free internet access. Here you use a different command, which does not install the modules, but only downloads them to a defined path.
Copy the entire contents of the PSModules folder to a storage medium of your choice (e.g. USB flash drive) and transfer the files to the desired offline system where PowerCLI is needed.
If you have admin rights on the target system, you can copy files to the loaction below.
Now the PowerCLI modules are also available on the offline system. For a version update the procedure must be repeated. It is advisable to remove the VMware modules before transferring the current ones.
The VMware Customer Experience Improvement Program collects data about the use of VMware products. You can either agree (true) or disagree (false). For offline systems, only the rejection (false) makes sense. The command shown below suppresses future notifications within PowerCLI.