Recently we had to shut down an entire datacenter for maintenance. The companys main power supply had a scheduled maintenance window of 5 hours and one of the requirements was that every single power consumer had to be shut off and disconnected.
So all VMs had been shut down, hosts were put in maintenance mode and shut down, also storages, fabrics, switches, climate control, ups – everything.
Startup with obstacles
Although we had precise shutdown and poweron procedures, there’s always a risk that something unforseen might cause trouble. In our case one of eight ESXi hosts failed to boot. It remained in a blank screen just before first ESX boot messages would appear. We’ve ruled out hardware issues with the host itself and found that the cause must be with the flash medium from which ESX boots.
Luckily we had a replacement medium with ESXi 6.0 alredy installed, but the configuration was blank. A confuguration backup of this particular host didn’t exist (note to self: ALWAYS generate config-backups, even if not touching ESXi installations!).
Normally this wouldn’t be a big issue. Just boot host with new media, apply host profile and everything would be fine. Unfortunately the customer has heterogenous hosts and special portgroups on certain hosts and no distributed vSwitches (I know, I know…).
Because we’ve put all hosts into maintenance mode, we did not evacuate VMs from the hosts. There was simply no-one left to take them. That’s the reason why all VMs on the failed host were not accessible and greyed in vCenter.
From my point of view this shouldn’t be a problem, because there were seven functional hosts left with access to shared storage. Why can’t we just migrate (the powered off) VM to a host that is alive? You can’t do this as long as the failed host is registered in the cluster. The official (and painful) procedure is to delete the failed host from the cluster, making all its VMs disappear. Then go through the datastores and add every missing VM to inventory. That’s a PITA!
The removal of the host took ages. Progress bar stopped at 100% but did not finish for a very long time.
He’s dead, Jim
Why can’t we just declare a host dead and let other hosts take over its VMs? If a host crashes during production, HA would do the same. In our case all VMs were powered off and had a clean shutdown before. So it’s the easiest thing to imagine.
During VMworld Europe I’ve been talking to some VMware designers and product managers. I got an insight into cool future features of vCenter, VUM, vSphere-Client and others (sorry, no details here – NDA). I was asked about my opinion about these new features and desigs. But I didn’t think about such a simple function that would make working with vSphere Clusters much more comfortable.
Maybe I’ve missed an obvious feature. If that’s the case, please let me know. I will happily blog about it.
If someone of VMware who’s in charge with vCenter or ESXi would get in touch with me, I’d be more than happy. Just drop me an email or contact me on Twitter.
I must say that I’m really impressed by VMware’s response on that issue. Soon after I’ve published this post and twittered about it, Dennis Lu, VMware product manager for the H5Client got in touch with me on twitter.
Missing @VMwarevSphere feature: how to declare a host dead. https://t.co/AoWKvissqf #blogtober #vexpert #VMTN @VMwareDesign
— Michael Schröder (@microlytix) October 15, 2017
Interesting, pretty odd situation with a hard recovery path (registering VMs). Can you DM me your email, I can loop in some other PMs too
— Dennis Lu (@dennisgoblu) October 16, 2017
2 Replies to “How to declare an ESXi host dead”
I just forwarded your issue and suggestion to the HA team. The problem of course is the following:
1. as you power off VMs manually, HA assumes there’s no need to restart
2. as the host hasn’t returned to the cluster, the cluster assumes it is still coming back.
I agree having an option to manually say: Host X is not returning for duty would be welcome. Potentially with two options: reregister VMs / Reregister and Restart VMs.
I see that HA in this case isn’t responsible. HA wasn’t activated and all hosts were in maintenance mode. So a restart wasn’t expected. But re-registering VMs to a different host would have saved our day.
I was surprised by the fact that removal of the host took so long. Timeout?
Thanks for forwarding this to the HA team. I’m curious.