Troubleshooting driver, firmware and ESXi version combinations
Hardware failures in vSphere clusters normally aren’t a big issue. Almost every component is redundant in one or the other way. If one component fails, another one will jump in and take over its function. Malfunction is a different thing and more serious than failure. Such a Zombie can become a real problem because as long as there are signs of life, a replacement will not jump in and there will be no failover.
I witnessed such a situation after a scheduled reboot of a Top-of-Rack (ToR) switch. An ESXi host that was connected to the switch with a 10 Gbit uplink became malfunctioning but didn’t fail.
As you can see in the picture, the link indicator and activity LEDs are active although the cable has been disconnected. A true sign that there is something wrong.
The virtual distributed switch (vDS) which uses the affected vmnic as uplink port, was set to static port binding. That means VMs are statistically distributed amongst all possible dvUplink ports. As a result some VMs could still be reached over LAN (vmnic1) while others were isolated (vmnic0). We would expect a failover in such a situation, but that was not the case. Vmnic0 hasn’t entirely failed. So neither VMs nor vMotion kernel ports triggered a failover to vmnic1.
As a result it wasn’t possible to evacuate the host because multi-nic-vMotion sent packets over vmnic0 and vmnic1. All efforts to vMotion VMs to another host failed.
The only way to force them to the last remaining operational adapter was to completely remove vmnic0 from vDS. After that I was able to vMotion VMs and shut down the host for troubleshooting.
What has happened? The NICs strange behaviour lead to the suspect that the problem might be related to hardware, driver and firmware. Therefore we need to get more information about the vmnic in question.
esxcli network nic list
The network adapter is an Emulex OneConnect OCe14000. It’s a LoM (LAN on motherboard) adapter. It’s an OEM (Original Equipment Manufacturer) component with an Emulex chipset on the server vendors board.
Let’s get some more detailed information about vmnic0.
esxcli network nic get -n vmnic0
As we can see vmnic0 uses driver elxnet version 10.2.309.6v with firmware 11.2.1194.36.
Because it is OEM hardware we need a precise description to query VMware’s hardware compatibility list (HCL).
VID, DID, SVID and SSID
To precisely find some piece of hardware you can search for “Vendor-ID” (VID), “Device-ID” (DID), „Subsystem-Vendor-ID“ (SVID) and “Subsystem-ID“ (SSID). VID indicates the chipset manufacturer (e.g. Intel, Emulex or Qlogic). Device-ID (DID) describes the chipset. Occasionally server manufacturers buy 3rd party chipsets and assemble them on their boards. Therefore we need the subvendor ID (SVID) (e.g. Dell, Fujitsu, HPE). They have their own device IDs, the subsystem ID (SSID). With all 4 numbers at hand you can precisely identify the device.
vmkchdev -l | grep vmnic0
Vendor ID (VID)= 10df
Device ID (DID)= 0720
Sub-Vendor ID (SVID)= 1734
Sub-Device ID (SDID)= 120e
Now we can query VMware HCL by selecting the dropdown boxes within the red rectangle.
We now see it’s an Emulex OCe14000-LoM labelled as Fujitsu. The adapter in question resides in a Fujitsu Primergy RX2540 M1 server. LoM means LAN on Motherboard. It has an adjustable slot for different connectors (SFP+, 10G-T, 1G-T).
HCL tells us that certain driver versions require distinct firmware versions. We can query installed driver information on the CLI.
esxcli software vib list | grep elxnet elxnet 10.2.309.6v-1vmw.600.0.0.2494585 VMware VMwareCertified 2018-10-01 emulex-esx-elxnetcli 10.2.309.6v-0.0.2494585 VMware VMwareCertified 2018-10-01
We have seen in adapter details that we’re using firmware 11.2.1194.36. For this firmware version it’s recommended to use driver version 11.2.1149.0 (red box) and not the installed version 10.2.309.6v.
The driver / firmware combination in use (yellow) is not supported for ESXi 6.0 and has to be corrected.
The requested driver is not included in standard ESXi 6.0 U3 and has to be installed separately. The corresponding download link can be accessed by expanding [+]. The driver can be installed either by update manager or CLI.
A look into Fujitsu Support Matrix for Broadcom Adapter (ex Emulex) confirms the driver / firmware combination.
The following Broadcom CNA driver and firmware combinations have been tested and released on FUJITSU Server PRIMERGY Hardware with VMware ESXi versions 5.x and 6.x
Driver Firmware Matrix for ESXi 6.0
ESXi Version / Driver Name / Driver Version / Firmware Version
Not only before system upgrade you need to have an eye on hardware compatibility – also before firmware update. New firmware is surely a good idea, because it usually fixes problems an brings enhancements, but the driver has to match the firmware.