I had the special pleasure of working on a book project as co-author in the past months. It is entitled “VMware vSphere 7 – Das umfassende Handbuch” (“VMware vSphere 7 – The Compendium”, published in German language) and will be published in November by Rheinwerk-Verlag. It is the 6th updated and extended edition of this series.
This book covers a wide range of vSphere 7. From basic architecture to setup and day-2 operations. It helps novice and advanced IT administrators understand the principles of vSphere, network virtualization with NSX-T, vSAN, container workloads, VMware Cloud Foundation, Hybrid Cloud, and SDDC.
My contributions are the completely new written chapters Monitoring and vSAN. The chapter Monitoring is about giving the administrator an overview of the integrated monitoring tools and how to use and interpret them. It also introduces VMware and third-party tools. The vSAN chapter explains the fundamental structure of this storage virtualization and explains the special features of a vSAN cluster in comparison to conventional storage solutions.
After a failed firmware update on my Intel x722 NICs one host came up without its 10 Gbit kernelports (vSAN Network). Every effort of recovery failed and I had to send in my “bricked” host to Supermicro. Normally this shouldn’t be a big issue in a 4-node cluster. But the fact that management interfaces were up and vSAN interfaces were not must have caused some “disturbance” on the cluster and all my VM objects were marked as “invalid” on the 3 remaining hosts.
I was busy on projects so I didn’t have much lab-time anyway, so I waited for the repair of the 4th host. Last week it finally arrived and I instantly assembled boot media, cache and capacity disks. I checked MAC addresses and settings on the repaired host and everything looked good. But after booting the reunited cluster still all objects were marked invalid.
Time for troubleshooting
First I opened SSH shells to each host. There’s a quick powerCLI one-liner to enable SSH throughout the cluster. Too bad I didn’t have a functional vCenter at that time, so I had to activate SSH on each host with the host client.
From the shell of the repaired host I’ve checked the vSAN-Network connection to all other vSAN kernel ports . The command below pings from interface vmk1 (vSAN) to IP 10.0.100.11 (vSAN kernel port of esx01 for example)
vmkping -I vmk1 10.0.100.11
I received ping responses from all hosts on all vSAN kernel ports. So I could conclude there’s no connection issue in the vSAN-network.
Patch build 16324942 for ESXi 7.0 has been released on June 23rd 2020. It will raise ESXi 7.0 GA to ESXi 7.0b. As usual I’m patching my homelab systems ASAP. As all hosts are fully compliant with HCL, I chose a fully automated cluster remediation by vSphere Lifecycle Manager (vLCM).
7.0 GA build 15843807 (before) / 7.0b build 16324942 (after)
During host reboot I realized a temperature warning LED on the chassis. A look into IPMI revealed a critical CPU temperature state. Also the fans responsible for CPU airflow ran at maximum speed.
As you can see, system temperature was moderate and fans usually run at low to medium speed under these conditions. Air intake temperature was 25°C.
My ESXi nodes rebooted with the new build 16324942 and there were no errors in vLCM. But I could hear there’s somethin wrong. A fan running at speed over 8000 RPM will tell you there IS something to look after. Also the boot procedure took much longer than usual.
I quickly shut down the whole cluster in order to avoid a core meltdown.